jorgecarleitao / arrow2

Transmute-free Rust library to work with the Arrow format
Apache License 2.0
1.06k stars 221 forks source link

Add a "contains" fast-path to `like_utf8_scalar` #1582

Closed RyanMarcus closed 1 year ago

RyanMarcus commented 1 year ago

This PR uses memchr to add a fast path to like_utf8_scalar for when the pattern can be processed as a "contains" query. For example, if the pattern is %ABBA%, then we can check to see if each string contains ABBA instead of building a regular expression.

To measure the performance improvement from this fast path, I added a benchmark. Here are the results on my machine:

Length regex memchr
2^16 63.5 µs 0.88 µs
2^17 68.1 µs 1.04 µs
2^18 72.4 µs 1.05 µs
2^19 76.7 µs 1.11 µs
2^20 81.3 µs 1.13 µs

Since memchr does the state-of-the-art SIMD tricks (as far as I know), this technique should even be faster for "contains" queries than the glob-matching suggestion in #1295 .

codecov[bot] commented 1 year ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Comparison is base (9a26422) 83.38% compared to head (b8fbe12) 83.39%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #1582 +/- ## ======================================= Coverage 83.38% 83.39% ======================================= Files 391 391 Lines 42983 42993 +10 ======================================= + Hits 35841 35853 +12 + Misses 7142 7140 -2 ``` | [Files](https://app.codecov.io/gh/jorgecarleitao/arrow2/pull/1582?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao) | Coverage Δ | | |---|---|---| | [src/compute/like.rs](https://app.codecov.io/gh/jorgecarleitao/arrow2/pull/1582?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao#diff-c3JjL2NvbXB1dGUvbGlrZS5ycw==) | `64.73% <100.00%> (+1.95%)` | :arrow_up: | ... and [6 files with indirect coverage changes](https://app.codecov.io/gh/jorgecarleitao/arrow2/pull/1582/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao)

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.