This PR changes the Latin camelCase segmentation to address the following cases:
openSSL should be segmented as ['open', 'SSL']
openSSLError should be segmented as ['open', 'SSL', 'Error']
This is done by introducing a helper iteration for the main iterator to "peek" the next char, and this is needed in the currently solution with StrGroupBy::linear_group_by afaik since the segmentation sometimes depends on the existence of a lowercase letter after the one currently being analyzed ( like openSSLError )
Benchmarks
the benchmarks ran once WITH the changes, and once more after commenting the changes out. The output here is from after removing the changes
cargo bench
And what I believe is the relevant output is the following
segment/132/Latin/Eng time: [2.5651 µs 2.5714 µs 2.5786 µs]
change: [+0.6549% +0.8593% +1.0487%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
1 (1.00%) low mild
2 (2.00%) high mild
3 (3.00%) high severe
segment/132/Latin/Fra time: [2.3521 µs 2.3547 µs 2.3572 µs]
change: [-0.3826% -0.0443% +0.2662%] (p = 0.79 > 0.05)
No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) low mild
segment/131/Latin/Deu time: [2.1346 µs 2.1372 µs 2.1403 µs]
change: [+0.4418% +0.5896% +0.7584%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
4 (4.00%) high mild
1 (1.00%) high severe
segment/363/Latin/Eng time: [6.6601 µs 6.6749 µs 6.6911 µs]
change: [+3.1214% +3.4133% +3.7267%] (p = 0.00 < 0.05)
Performance has regressed.
Found 18 outliers among 100 measurements (18.00%)
8 (8.00%) low mild
3 (3.00%) high mild
7 (7.00%) high severe
segment/363/Latin/Fra time: [6.5405 µs 6.5481 µs 6.5566 µs]
change: [-0.4942% -0.2417% +0.0139%] (p = 0.07 > 0.05)
No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
5 (5.00%) high mild
1 (1.00%) high severe
segment/365/Latin/Vie time: [5.6049 µs 5.6109 µs 5.6170 µs]
change: [-4.2406% -3.9811% -3.7043%] (p = 0.00 < 0.05)
Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
1 (1.00%) low mild
4 (4.00%) high mild
tokenize/132/Latin/Eng time: [6.1291 µs 6.1349 µs 6.1406 µs]
change: [-1.4925% -1.3344% -1.1740%] (p = 0.00 < 0.05)
Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
4 (4.00%) high mild
2 (2.00%) high severe
tokenize/132/Latin/Fra time: [6.9543 µs 6.9617 µs 6.9696 µs]
change: [-1.5821% -1.3067% -1.0380%] (p = 0.00 < 0.05)
Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
3 (3.00%) low mild
4 (4.00%) high mild
1 (1.00%) high severe
tokenize/131/Latin/Deu time: [6.4793 µs 6.4916 µs 6.5050 µs]
change: [-3.1861% -2.9351% -2.6852%] (p = 0.00 < 0.05)
Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) low mild
tokenize/363/Latin/Eng time: [15.510 µs 15.523 µs 15.535 µs]
change: [-8.2776% -7.9652% -7.6477%] (p = 0.00 < 0.05)
Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
9 (9.00%) high mild
tokenize/363/Latin/Fra time: [18.026 µs 18.044 µs 18.064 µs]
change: [-2.4157% -1.5634% -1.0380%] (p = 0.00 < 0.05)
Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
2 (2.00%) low mild
3 (3.00%) high mild
tokenize/365/Latin/Vie time: [27.604 µs 27.626 µs 27.651 µs]
change: [-2.5536% -1.9732% -1.4007%] (p = 0.00 < 0.05)
Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
1 (1.00%) low mild
1 (1.00%) high mild
1 (1.00%) high severe
tokenize/354/Latin/Deu time: [17.874 µs 17.904 µs 17.936 µs]
change: [+2.2411% +2.4528% +2.6719%] (p = 0.00 < 0.05)
Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
PR checklist
Please check if your PR fulfills the following requirements:
[x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
[x] Have you read the contributing guidelines?
[x] Have you made sure that the title is accurate and descriptive of the changes?
Pull Request
All interesting changes are in charabia/src/segmenter/latin/camel_case.rs other changes are import reordering caused by cargo fmt.
This change definitely looks like a small speed regression, but I will be opening for review nonetheless.
First time contributor to any meilisearch repo.
Related issue
Fixes #289
What does this PR do?
This PR changes the Latin camelCase segmentation to address the following cases:
This is done by introducing a helper iteration for the main iterator to "peek" the next char, and this is needed in the currently solution with
StrGroupBy::linear_group_by
afaik since the segmentation sometimes depends on the existence of a lowercase letter after the one currently being analyzed ( like openSSLError )Benchmarks
the benchmarks ran once WITH the changes, and once more after commenting the changes out. The output here is from after removing the changes
And what I believe is the relevant output is the following
PR checklist
Please check if your PR fulfills the following requirements: