Closed RustyYato closed 7 months ago
regex_syntax
has a ClassUniciode::is_ascii
method and ClassUnicode::maximum_len
, which seem ideal for this use-case.
I think it would be better to swap to those here, but chose the smaller code-change for now.
If you think updating to regex_sytnax
's versions is OK, then I can make that update as well.
Looks great, thanks! Sadly, benchmarks show a dramatic decrease in performances :/
So, while this is a nice fix, I would be happy not to make Logos much slower ^^' I don't know if that is an unavoidable performance loss, or if we can do better (maybe trying what you said above).
https://github.com/maciejhirsz/logos/actions/runs/7925237413?pr=376
group before changes
----- ------ -------
count_ok/identifiers 1.00 837.5±4.71ns 887.1 MB/sec 1.02 855.6±7.21ns 868.3 MB/sec
count_ok/keywords_operators_and_punctators 1.00 2.4±0.02µs 834.2 MB/sec 1.05 2.6±0.04µs 791.0 MB/sec
count_ok/strings 1.00 516.9±25.99ns 1607.0 MB/sec 2.66 1375.7±239.67ns 603.8 MB/sec
iterate/identifiers 1.00 781.6±48.86ns 950.4 MB/sec 1.08 845.1±4.53ns 879.0 MB/sec
iterate/keywords_operators_and_punctators 1.00 2.5±0.04µs 818.5 MB/sec 1.00 2.5±0.04µs 820.8 MB/sec
iterate/strings 1.00 539.2±5.50ns 1540.4 MB/sec 2.78 1501.5±14.53ns 553.2 MB/sec
In general there is an issue between repeating matches and non-repeating matches, as well as shared logic between the dot .
and exclusion ranges like [^"]
.
Consider these two:
[^"]
- this will take one byte of input and will produce UB same as .
[^"]+
- is currently safe as it will eat all bytes that are not ascii "
, including all multibyte codes.+
is kind of nonsensical in Logos so it's not a problem here.
Now what this PR does is fall back on full unicode checking in both cases. If you can make the check for tail end of range being 0x0010_FFFF
conditional on whether or not the pattern is repeating I believe the performance of the string benchmarks would be unaffected.
@RustyYato I did try your suggestion. On local benchmarks, the impact on performances was much less important, so let's see.
@RustyYato I did try your suggestion. On local benchmarks, the impact on performances was much less important, so let's see.
Just speculating here, but unicode checking expands into a lot of code (more than it needs to and if I were a better programmer it would be smarter), so this is likely to swing a lot based on your icache.
@maciejhirsz indeed...
Weirdly enough, the benchmarks ran with much closer timings on my computer than in GitHub CI. My previous commit does not seem to have changed the benchmarks (or maybe I missed something).
On my computer:
group changes before
----- ------- ------
count_ok/identifiers 1.05 474.3±25.25ns 1566.2 MB/sec 1.00 452.5±18.12ns 1641.9 MB/sec
count_ok/keywords_operators_and_punctators 1.02 1284.0±14.11ns 1582.8 MB/sec 1.00 1253.0±17.62ns 1621.9 MB/sec
count_ok/strings 1.23 620.5±3.93ns 1338.8 MB/sec 1.00 503.4±7.15ns 1649.9 MB/sec
iterate/identifiers 1.05 468.6±16.73ns 1585.5 MB/sec 1.00 448.1±7.56ns 1657.8 MB/sec
iterate/keywords_operators_and_punctators 1.04 1315.6±35.99ns 1544.7 MB/sec 1.00 1259.5±38.74ns 1613.6 MB/sec
iterate/strings 1.27 635.5±21.96ns 1307.0 MB/sec 1.00 501.7±17.36ns 1655.6 MB/sec
In CI:
group before changes
----- ------ -------
count_ok/identifiers 1.02 841.2±5.39ns 883.2 MB/sec 1.00 822.4±53.44ns 903.4 MB/sec
count_ok/keywords_operators_and_punctators 1.00 2.4±0.01µs 834.5 MB/sec 1.03 2.5±0.05µs 806.5 MB/sec
count_ok/strings 1.00 543.7±20.57ns 1527.8 MB/sec 2.74 1490.1±5.38ns 557.4 MB/sec
iterate/identifiers 1.12 838.7±16.82ns 885.8 MB/sec 1.00 748.1±5.80ns 993.1 MB/sec
iterate/keywords_operators_and_punctators 1.00 2.5±0.02µs 814.3 MB/sec 1.05 2.6±0.02µs 777.1 MB/sec
iterate/strings 1.00 522.0±27.49ns 1591.4 MB/sec 2.97 1552.2±15.09ns 535.2 MB/sec
Now what this PR does is fall back on full unicode checking in both cases. If you can make the check for tail end of range being
0x0010_FFFF
conditional on whether or not the pattern is repeating I believe the performance of the string benchmarks would be unaffected.
I'll try to do that :-)
Looks like I managed (?) to fix the performances issues, while keeping the fix :-)
group before changes
----- ------ -------
count_ok/identifiers 1.00 839.1±14.04ns 885.3 MB/sec 1.01 848.7±5.73ns 875.4 MB/sec
count_ok/keywords_operators_and_punctators 1.00 2.5±0.06µs 822.6 MB/sec 1.02 2.5±0.04µs 805.2 MB/sec
count_ok/strings 1.00 517.8±24.07ns 1604.1 MB/sec 1.00 520.0±25.26ns 1597.5 MB/sec
iterate/identifiers 1.00 834.2±18.57ns 890.5 MB/sec 1.01 844.3±7.16ns 879.9 MB/sec
iterate/keywords_operators_and_punctators 1.00 2.5±0.02µs 818.9 MB/sec 1.01 2.5±0.05µs 813.6 MB/sec
iterate/strings 1.03 548.3±15.68ns 1514.9 MB/sec 1.00 533.1±24.37ns 1558.0 MB/sec
Maybe it would be worth to double-check that, @RustyYato or @maciejhirsz.
I think I might also want to put the tests to another file
LGTM aside from the
start
of the range needing to be checked as per comment.
Just made necessary changes in my last commits :-)
Merging this, thank you, @RustyYato and @maciejhirsz for your help!
Thanks! My suggestion wouldn't have affected the runtime performance of Logos, but it looks like you managed to fix that anyways.
Fixes #375
Even when using
str
,regex(".")
cannot use the ASCII fast path because the input string may contain multibyte codepoints. Using the ASCII fast path would split the bytes of these codepoints up, which leads to UB (as soon as the illegal string is witnessed). This UB can be seen in several ways, from capturing the strings (like in #375), usingLexer::remainder
, etc. The only real fix is not to take the ASCII fast path.I've added tests to ensure this isn't hit again in the refactor :slightly_smiling_face:.