Closed egorgrachev closed 2 years ago
Ah, damn. Sorry! I've created that issue mostly as a reminder for myself.
I've addressed it in https://github.com/greyblake/whatlang-rs/pull/116, I will take a look at your PR when I have time and fresh, but likely I will have to bounce it, because the whole idea of that refactoring was actually to make cyrillic::alphabet_calculate_scores
to improve performance by leveraging optimization that was implemented for latin langs.
No problem, I didn't notice #116 I see that you pre-compute lang_map, I thought about the same when I mentioned that "It's probably possible to optimise"
Most likely no need to merge my PR then :)
Also for my understanding - how important the performance for you in this code?
The algo code is not very easy to understand, but it may be done way cleaner with HashSet / HashMap (or BTreeSet / TreeMap) for char-langs, char_scores and lang_scores.
I profiled the code a bit and for me most of the time were spent in get_all_chars_in_langs and get_alphabet_lang_map. I can take a look at refactoring to simplify / improve readability if you think it make sense to do so
@egorgrachev Thank you for understanding.
Also for my understanding - how important the performance for you in this code? It's crucial. There were number of optimization from different people since the project started in 2016. The performance is crucial, because whatlang kind of competes with https://github.com/pemistahl/lingua-rs. Due to the model lingua uses it has better accuracy, so what whatlang can offer is:
- Much better performance (last time I did benchmarking Whatlang was 11X faster then Lingua). I guess today it's still 10x-20x faster.
- Very small memory and binary footprint: the endtire WASM file for https://whatlang.org is about 500Kb, while Lingua compiled has 90MB WASM.
Generally I would prefer readability over performance and avoid premature optimizations. But this project is a rare project of mine, where I'd like to squeeze the maxium from hardware. That said I'd like to keep the codebase well documented and understandable.
Whatlang is also used by Sonic search engine. Guys from meilisearch also recently contributed so I'd assume they're using Whatlang already or considering using it for Melisearch.
I am closing this PR in favor of https://github.com/greyblake/whatlang-rs/pull/116
@egorgrachev If you're interested in further contributions, let's stay in touch.
111
it degrades performance a bit, mostly because we re-compute alphabet lang map instead of having static variable. It's probably possible to optimise
on my pc cargo bench went from 5kk to 8kk ns