unicode-rs / unicode-segmentation

Grapheme Cluster and Word boundaries according to UAX#29 rules
https://unicode-rs.github.io/unicode-segmentation
Other
565 stars 57 forks source link

Benchmark other methods mentioned in README #97

Closed timClicks closed 3 years ago

timClicks commented 3 years ago

The word boundary extensions to &str/String behave in a very similar, but not identical manner to .graphemes(). For example, Mandarin to slow(ish) on .graphemes() but fast(ish) on .word_boundaries() whereas languages with whitespace-delimited words tend to have the same performance characteristics with the latter methods.

As the library develops, it would be worthwhile to monitor the speed of the rest of the documented API.

Out of interest, here are the results of local benchmarking:

     Running unittests (target/release/deps/graphemes-564b84453b2889b6)

running 8 tests
test graphemes_arabic      ... bench:     569,134 ns/iter (+/- 67,321) = 88 MB/s
test graphemes_english     ... bench:     772,797 ns/iter (+/- 72,533) = 64 MB/s
test graphemes_hindi       ... bench:     557,920 ns/iter (+/- 62,334) = 88 MB/s
test graphemes_japanese    ... bench:     592,961 ns/iter (+/- 98,301) = 85 MB/s
test graphemes_korean      ... bench:   1,069,377 ns/iter (+/- 152,167) = 46 MB/s
test graphemes_mandarin    ... bench:     418,928 ns/iter (+/- 47,041) = 120 MB/s
test graphemes_russian     ... bench:     582,695 ns/iter (+/- 64,659) = 87 MB/s
test graphemes_source_code ... bench:     837,820 ns/iter (+/- 103,583) = 59 MB/s

test result: ok. 0 passed; 0 failed; 0 ignored; 8 measured

     Running unittests (target/release/deps/unicode_words-ae63859a9debc323)

running 8 tests
test unicode_words_arabic      ... bench:     654,021 ns/iter (+/- 84,369) = 76 MB/s
test unicode_words_english     ... bench:   1,123,919 ns/iter (+/- 126,888) = 44 MB/s
test unicode_words_hindi       ... bench:     529,154 ns/iter (+/- 123,820) = 93 MB/s
test unicode_words_japanese    ... bench:   1,225,327 ns/iter (+/- 97,295) = 41 MB/s
test unicode_words_korean      ... bench:     620,752 ns/iter (+/- 44,833) = 80 MB/s
test unicode_words_mandarin    ... bench:   1,166,284 ns/iter (+/- 81,349) = 43 MB/s
test unicode_words_russian     ... bench:     700,773 ns/iter (+/- 72,376) = 72 MB/s
test unicode_words_source_code ... bench:   1,212,000 ns/iter (+/- 61,977) = 41 MB/s

test result: ok. 0 passed; 0 failed; 0 ignored; 8 measured

     Running unittests (target/release/deps/word_bounds-e8efa40319028a56)

running 8 tests
test word_bounds_arabic      ... bench:     526,613 ns/iter (+/- 167,663) = 95 MB/s
test word_bounds_english     ... bench:     986,120 ns/iter (+/- 150,942) = 50 MB/s
test word_bounds_hindi       ... bench:     405,190 ns/iter (+/- 75,288) = 122 MB/s
test word_bounds_japanese    ... bench:     819,973 ns/iter (+/- 104,399) = 61 MB/s
test word_bounds_korean      ... bench:     475,308 ns/iter (+/- 50,833) = 105 MB/s
test word_bounds_mandarin    ... bench:     735,141 ns/iter (+/- 111,634) = 68 MB/s
test word_bounds_russian     ... bench:     544,748 ns/iter (+/- 113,632) = 93 MB/s
test word_bounds_source_code ... bench:   1,129,101 ns/iter (+/- 361,817) = 44 MB/s

test result: ok. 0 passed; 0 failed; 0 ignored; 8 measured