greyblake / whatlang-rs

Natural language detection library for Rust. Try demo online: https://whatlang.org/
https://whatlang.org/
MIT License
965 stars 108 forks source link

Using generic function to compute alphabet score for Cyrillic and latin #115

Closed egorgrachev closed 2 years ago

egorgrachev commented 2 years ago

111

it degrades performance a bit, mostly because we re-compute alphabet lang map instead of having static variable. It's probably possible to optimise

on my pc cargo bench went from 5kk to 8kk ns

greyblake commented 2 years ago

Ah, damn. Sorry! I've created that issue mostly as a reminder for myself. I've addressed it in https://github.com/greyblake/whatlang-rs/pull/116, I will take a look at your PR when I have time and fresh, but likely I will have to bounce it, because the whole idea of that refactoring was actually to make cyrillic::alphabet_calculate_scores to improve performance by leveraging optimization that was implemented for latin langs.

egorgrachev commented 2 years ago

No problem, I didn't notice #116 I see that you pre-compute lang_map, I thought about the same when I mentioned that "It's probably possible to optimise"

Most likely no need to merge my PR then :)

Also for my understanding - how important the performance for you in this code?
The algo code is not very easy to understand, but it may be done way cleaner with HashSet / HashMap (or BTreeSet / TreeMap) for char-langs, char_scores and lang_scores. I profiled the code a bit and for me most of the time were spent in get_all_chars_in_langs and get_alphabet_lang_map. I can take a look at refactoring to simplify / improve readability if you think it make sense to do so

greyblake commented 2 years ago

@egorgrachev Thank you for understanding.

Also for my understanding - how important the performance for you in this code? It's crucial. There were number of optimization from different people since the project started in 2016. The performance is crucial, because whatlang kind of competes with https://github.com/pemistahl/lingua-rs. Due to the model lingua uses it has better accuracy, so what whatlang can offer is:

Generally I would prefer readability over performance and avoid premature optimizations. But this project is a rare project of mine, where I'd like to squeeze the maxium from hardware. That said I'd like to keep the codebase well documented and understandable.

Whatlang is also used by Sonic search engine. Guys from meilisearch also recently contributed so I'd assume they're using Whatlang already or considering using it for Melisearch.

I am closing this PR in favor of https://github.com/greyblake/whatlang-rs/pull/116

greyblake commented 2 years ago

@egorgrachev If you're interested in further contributions, let's stay in touch.