Fix unicode dot - Githubissues

RustyYato commented 7 months ago

Fixes #375

Even when using str, regex(".") cannot use the ASCII fast path because the input string may contain multibyte codepoints. Using the ASCII fast path would split the bytes of these codepoints up, which leads to UB (as soon as the illegal string is witnessed). This UB can be seen in several ways, from capturing the strings (like in #375), using Lexer::remainder, etc. The only real fix is not to take the ASCII fast path.

I've added tests to ensure this isn't hit again in the refactor :slightly_smiling_face:.

RustyYato commented 7 months ago

regex_syntax has a ClassUniciode::is_ascii method and ClassUnicode::maximum_len, which seem ideal for this use-case. I think it would be better to swap to those here, but chose the smaller code-change for now. If you think updating to regex_sytnax's versions is OK, then I can make that update as well.

jeertmans commented 7 months ago

Looks great, thanks! Sadly, benchmarks show a dramatic decrease in performances :/

So, while this is a nice fix, I would be happy not to make Logos much slower ^^' I don't know if that is an unavoidable performance loss, or if we can do better (maybe trying what you said above).

https://github.com/maciejhirsz/logos/actions/runs/7925237413?pr=376

group                                         before                                 changes
-----                                         ------                                 -------
count_ok/identifiers                          1.00    837.5±4.71ns   887.1 MB/sec    1.02    855.6±7.21ns   868.3 MB/sec
count_ok/keywords_operators_and_punctators    1.00      2.4±0.02µs   834.2 MB/sec    1.05      2.6±0.04µs   791.0 MB/sec
count_ok/strings                              1.00   516.9±25.99ns  1607.0 MB/sec    2.66  1375.7±239.67ns   603.8 MB/sec
iterate/identifiers                           1.00   781.6±48.86ns   950.4 MB/sec    1.08    845.1±4.53ns   879.0 MB/sec
iterate/keywords_operators_and_punctators     1.00      2.5±0.04µs   818.5 MB/sec    1.00      2.5±0.04µs   820.8 MB/sec
iterate/strings                               1.00    539.2±5.50ns  1540.4 MB/sec    2.78  1501.5±14.53ns   553.2 MB/sec

maciejhirsz commented 7 months ago

In general there is an issue between repeating matches and non-repeating matches, as well as shared logic between the dot . and exclusion ranges like [^"].

Consider these two:

[^"] - this will take one byte of input and will produce UB same as .
[^"]+ - is currently safe as it will eat all bytes that are not ascii ", including all multibyte codes

.+ is kind of nonsensical in Logos so it's not a problem here.

Now what this PR does is fall back on full unicode checking in both cases. If you can make the check for tail end of range being 0x0010_FFFF conditional on whether or not the pattern is repeating I believe the performance of the string benchmarks would be unaffected.

jeertmans commented 7 months ago

@RustyYato I did try your suggestion. On local benchmarks, the impact on performances was much less important, so let's see.

maciejhirsz commented 7 months ago

@RustyYato I did try your suggestion. On local benchmarks, the impact on performances was much less important, so let's see.

Just speculating here, but unicode checking expands into a lot of code (more than it needs to and if I were a better programmer it would be smarter), so this is likely to swing a lot based on your icache.

jeertmans commented 7 months ago

@maciejhirsz indeed...

Weirdly enough, the benchmarks ran with much closer timings on my computer than in GitHub CI. My previous commit does not seem to have changed the benchmarks (or maybe I missed something).

On my computer:

group                                         changes                                before                                                                                                                
-----                                         -------                                ------                                                                                                                
count_ok/identifiers                          1.05   474.3±25.25ns  1566.2 MB/sec    1.00   452.5±18.12ns  1641.9 MB/sec                                                                                   
count_ok/keywords_operators_and_punctators    1.02  1284.0±14.11ns  1582.8 MB/sec    1.00  1253.0±17.62ns  1621.9 MB/sec                                                                                   
count_ok/strings                              1.23    620.5±3.93ns  1338.8 MB/sec    1.00    503.4±7.15ns  1649.9 MB/sec                                                                                   
iterate/identifiers                           1.05   468.6±16.73ns  1585.5 MB/sec    1.00    448.1±7.56ns  1657.8 MB/sec                                                                                   
iterate/keywords_operators_and_punctators     1.04  1315.6±35.99ns  1544.7 MB/sec    1.00  1259.5±38.74ns  1613.6 MB/sec                                                                                   
iterate/strings                               1.27   635.5±21.96ns  1307.0 MB/sec    1.00   501.7±17.36ns  1655.6 MB/sec

In CI:

group                                         before                                 changes
-----                                         ------                                 -------
count_ok/identifiers                          1.02    841.2±5.39ns   883.2 MB/sec    1.00   822.4±53.44ns   903.4 MB/sec
count_ok/keywords_operators_and_punctators    1.00      2.4±0.01µs   834.5 MB/sec    1.03      2.5±0.05µs   806.5 MB/sec
count_ok/strings                              1.00   543.7±20.57ns  1527.8 MB/sec    2.74   1490.1±5.38ns   557.4 MB/sec
iterate/identifiers                           1.12   838.7±16.82ns   885.8 MB/sec    1.00    748.1±5.80ns   993.1 MB/sec
iterate/keywords_operators_and_punctators     1.00      2.5±0.02µs   814.3 MB/sec    1.05      2.6±0.02µs   777.1 MB/sec
iterate/strings                               1.00   522.0±27.49ns  1591.4 MB/sec    2.97  1552.2±15.09ns   535.2 MB/sec

jeertmans commented 7 months ago

Now what this PR does is fall back on full unicode checking in both cases. If you can make the check for tail end of range being 0x0010_FFFF conditional on whether or not the pattern is repeating I believe the performance of the string benchmarks would be unaffected.

I'll try to do that :-)

jeertmans commented 7 months ago

Looks like I managed (?) to fix the performances issues, while keeping the fix :-)

group                                         before                                 changes
-----                                         ------                                 -------
count_ok/identifiers                          1.00   839.1±14.04ns   885.3 MB/sec    1.01    848.7±5.73ns   875.4 MB/sec
count_ok/keywords_operators_and_punctators    1.00      2.5±0.06µs   822.6 MB/sec    1.02      2.5±0.04µs   805.2 MB/sec
count_ok/strings                              1.00   517.8±24.07ns  1604.1 MB/sec    1.00   520.0±25.26ns  1597.5 MB/sec
iterate/identifiers                           1.00   834.2±18.57ns   890.5 MB/sec    1.01    844.3±7.16ns   879.9 MB/sec
iterate/keywords_operators_and_punctators     1.00      2.5±0.02µs   818.9 MB/sec    1.01      2.5±0.05µs   813.6 MB/sec
iterate/strings                               1.03   548.3±15.68ns  1514.9 MB/sec    1.00   533.1±24.37ns  1558.0 MB/sec

Maybe it would be worth to double-check that, @RustyYato or @maciejhirsz.

I think I might also want to put the tests to another file

jeertmans commented 7 months ago

LGTM aside from the start of the range needing to be checked as per comment.

Just made necessary changes in my last commits :-)

jeertmans commented 7 months ago

Merging this, thank you, @RustyYato and @maciejhirsz for your help!

RustyYato commented 7 months ago

Thanks! My suggestion wouldn't have affected the runtime performance of Logos, but it looks like you managed to fix that anyways.

maciejhirsz / logos

Fix unicode dot #376