ManyTheFish commented 2 years ago

Summary

Optimize alphabet_calculate_scores function used during Latin Language detection.

Compute the score in two steps:

iterate over the text and scores character based on their frequency
then sum the character's scores in Language's scores

This avoids imbricated loops that make the compute complexity quadratic.

For now, I didn't do anything on the trigrams part, the behavior is more complicated to understand. 😅 But, I will probably try to optimize it in another PR.

Whatlang benchmarks

main branch

test bench_detect        ... bench:   8,120,987 ns/iter (+/- 119,807)
test bench_detect_script ... bench:     102,829 ns/iter (+/- 1,600)

test latin alphabet only ... bench:   3,533,305 ns/iter (+/- 102,068)

Commits

Replace sort_by

test bench_detect        ... bench:   7,984,651 ns/iter (+/- 82,686)
test bench_detect_script ... bench:      93,662 ns/iter (+/- 2,249)

test latin alphabet only ... bench:   3,542,116 ns/iter (+/- 45,714)

Use inverted mapping between char and Lang

test bench_detect        ... bench:   7,642,256 ns/iter (+/- 91,700)
test bench_detect_script ... bench:      94,344 ns/iter (+/- 2,501)

test latin alphabet only ... bench:   3,221,397 ns/iter (+/- 32,611)

Clamp score in normalization loop instead of creating intermediate vec

test bench_detect        ... bench:   7,594,489 ns/iter (+/- 95,791)
test bench_detect_script ... bench:      98,816 ns/iter (+/- 2,039)

test latin alphabet only ... bench:   3,333,266 ns/iter (+/- 29,036)

Increment a common score when a common char is found

test bench_detect        ... bench:   5,190,589 ns/iter (+/- 73,054)
test bench_detect_script ... bench:      98,731 ns/iter (+/- 1,395)

test latin alphabet only ... bench:     846,625 ns/iter (+/- 13,893)

Use binary search instead of iter find

test bench_detect        ... bench:   4,992,537 ns/iter (+/- 77,899)
test bench_detect_script ... bench:      99,535 ns/iter (+/- 5,190)

test latin alphabet only ... bench:     575,347 ns/iter (+/- 7,259)

Fix returned raw score

test bench_detect        ... bench:   4,999,493 ns/iter (+/- 69,303)
test bench_detect_script ... bench:      93,847 ns/iter (+/- 1,339)

test latin alphabet only ... bench:     580,022 ns/iter (+/- 7,469)

Use intermediate char score

test bench_detect        ... bench:   4,802,885 ns/iter (+/- 73,186)
test bench_detect_script ... bench:      93,860 ns/iter (+/- 2,912)

test latin alphabet only ... bench:     388,278 ns/iter (+/- 4,427)

Count Max score in char score Loop

test bench_detect        ... bench:   4,837,791 ns/iter (+/- 71,111)
test bench_detect_script ... bench:      94,130 ns/iter (+/- 715)

test latin alphabet only ... bench:     354,250 ns/iter (+/- 11,778)

Make lang score access O(1) when iterating over char scores

test bench_detect        ... bench:   4,769,341 ns/iter (+/- 78,060)
test bench_detect_script ... bench:      93,748 ns/iter (+/- 706)

test latin alphabet only ... bench:     308,811 ns/iter (+/- 10,769)

greyblake commented 2 years ago

@ManyTheFish @Kerollmops Thank you guys! I do not promise to review this soon, because there is a lot of shit going on with my relatives and friends in Ukraine, and helping them and Ukraine is much higher priority for me at the moment.

greyblake commented 2 years ago

@ManyTheFish Just to let you know, I haven't forgotten about this PR.

greyblake commented 2 years ago

@ManyTheFish @Kerollmops Than you guys! This is very outstanding PR! I've learned something new today :) Would you like to add further improvements? If you want, we can try to arrange a call, I can explain how trigrams work :)

greyblake commented 2 years ago

FYI: the optimization is released in 0.14.0.

ManyTheFish commented 2 years ago

Hey @greyblake! I'm pleased to see this PR merged. 😊

I'll probably come back with a new PR between the 2 and the 5 of May if I have a bit of time to work on it. 🙂

greyblake / whatlang-rs

Optimize latin languages detection #108

Summary

Whatlang benchmarks

main branch

Commits

Replace sort_by

Use inverted mapping between char and Lang

Clamp score in normalization loop instead of creating intermediate vec

Increment a common score when a common char is found

Use binary search instead of iter find

Fix returned raw score

Use intermediate char score

Count Max score in char score Loop

Make lang score access O(1) when iterating over char scores