Closed ManyTheFish closed 2 years ago
@ManyTheFish @Kerollmops Thank you guys! I do not promise to review this soon, because there is a lot of shit going on with my relatives and friends in Ukraine, and helping them and Ukraine is much higher priority for me at the moment.
@ManyTheFish Just to let you know, I haven't forgotten about this PR.
@ManyTheFish @Kerollmops Than you guys! This is very outstanding PR! I've learned something new today :) Would you like to add further improvements? If you want, we can try to arrange a call, I can explain how trigrams work :)
FYI: the optimization is released in 0.14.0.
Hey @greyblake! I'm pleased to see this PR merged. π
I'll probably come back with a new PR between the 2 and the 5 of May if I have a bit of time to work on it. π
Summary
Optimize
alphabet_calculate_scores
function used during Latin Language detection.Compute the score in two steps:
This avoids imbricated loops that make the compute complexity quadratic.
For now, I didn't do anything on the trigrams part, the behavior is more complicated to understand. π But, I will probably try to optimize it in another PR.
Whatlang benchmarks
main branch
Commits
Replace sort_by
Use inverted mapping between char and Lang
Clamp score in normalization loop instead of creating intermediate vec
Increment a common score when a common char is found
Use binary search instead of iter find
Fix returned raw score
Use intermediate char score
Count Max score in char score Loop
Make lang score access O(1) when iterating over char scores