greyblake / whatlang-rs

Natural language detection library for Rust. Try demo online: https://whatlang.org/
https://whatlang.org/
MIT License
969 stars 109 forks source link

Optimize latin languages detection #108

Closed ManyTheFish closed 2 years ago

ManyTheFish commented 2 years ago

Summary

Optimize alphabet_calculate_scores function used during Latin Language detection.

Compute the score in two steps:

This avoids imbricated loops that make the compute complexity quadratic.

For now, I didn't do anything on the trigrams part, the behavior is more complicated to understand. πŸ˜… But, I will probably try to optimize it in another PR.

Whatlang benchmarks

main branch

test bench_detect        ... bench:   8,120,987 ns/iter (+/- 119,807)
test bench_detect_script ... bench:     102,829 ns/iter (+/- 1,600)

test latin alphabet only ... bench:   3,533,305 ns/iter (+/- 102,068)

Commits

Replace sort_by

test bench_detect        ... bench:   7,984,651 ns/iter (+/- 82,686)
test bench_detect_script ... bench:      93,662 ns/iter (+/- 2,249)

test latin alphabet only ... bench:   3,542,116 ns/iter (+/- 45,714)

Use inverted mapping between char and Lang

test bench_detect        ... bench:   7,642,256 ns/iter (+/- 91,700)
test bench_detect_script ... bench:      94,344 ns/iter (+/- 2,501)

test latin alphabet only ... bench:   3,221,397 ns/iter (+/- 32,611)

Clamp score in normalization loop instead of creating intermediate vec

test bench_detect        ... bench:   7,594,489 ns/iter (+/- 95,791)
test bench_detect_script ... bench:      98,816 ns/iter (+/- 2,039)

test latin alphabet only ... bench:   3,333,266 ns/iter (+/- 29,036)

Increment a common score when a common char is found

test bench_detect        ... bench:   5,190,589 ns/iter (+/- 73,054)
test bench_detect_script ... bench:      98,731 ns/iter (+/- 1,395)

test latin alphabet only ... bench:     846,625 ns/iter (+/- 13,893)

Use binary search instead of iter find

test bench_detect        ... bench:   4,992,537 ns/iter (+/- 77,899)
test bench_detect_script ... bench:      99,535 ns/iter (+/- 5,190)

test latin alphabet only ... bench:     575,347 ns/iter (+/- 7,259)

Fix returned raw score

test bench_detect        ... bench:   4,999,493 ns/iter (+/- 69,303)
test bench_detect_script ... bench:      93,847 ns/iter (+/- 1,339)

test latin alphabet only ... bench:     580,022 ns/iter (+/- 7,469)

Use intermediate char score

test bench_detect        ... bench:   4,802,885 ns/iter (+/- 73,186)
test bench_detect_script ... bench:      93,860 ns/iter (+/- 2,912)

test latin alphabet only ... bench:     388,278 ns/iter (+/- 4,427)

Count Max score in char score Loop

test bench_detect        ... bench:   4,837,791 ns/iter (+/- 71,111)
test bench_detect_script ... bench:      94,130 ns/iter (+/- 715)

test latin alphabet only ... bench:     354,250 ns/iter (+/- 11,778)

Make lang score access O(1) when iterating over char scores

test bench_detect        ... bench:   4,769,341 ns/iter (+/- 78,060)
test bench_detect_script ... bench:      93,748 ns/iter (+/- 706)

test latin alphabet only ... bench:     308,811 ns/iter (+/- 10,769)
greyblake commented 2 years ago

@ManyTheFish @Kerollmops Thank you guys! I do not promise to review this soon, because there is a lot of shit going on with my relatives and friends in Ukraine, and helping them and Ukraine is much higher priority for me at the moment.

greyblake commented 2 years ago

@ManyTheFish Just to let you know, I haven't forgotten about this PR.

greyblake commented 2 years ago

@ManyTheFish @Kerollmops Than you guys! This is very outstanding PR! I've learned something new today :) Would you like to add further improvements? If you want, we can try to arrange a call, I can explain how trigrams work :)

greyblake commented 2 years ago

FYI: the optimization is released in 0.14.0.

ManyTheFish commented 2 years ago

Hey @greyblake! I'm pleased to see this PR merged. 😊

I'll probably come back with a new PR between the 2 and the 5 of May if I have a bit of time to work on it. πŸ™‚