jflex-de / jflex

The fast scanner generator for Java™ with full Unicode support
http://jflex.de
Other
586 stars 114 forks source link

Whitespaces negation in group not working as expected #1065

Closed hurricup closed 1 year ago

hurricup commented 1 year ago

I'm currently migrating from the 1.7.0 to 1.9.0 and my test uncovered that in 1.7.0 [^\n\-\s%]+ did not matched spaces, but in 1.9.0 it does. And now I need to write it like [^\n\- \t%]+ And this feels wrong.

https://regex101.com/r/eOBDPX/1

lsf37 commented 1 year ago

Hm, that does sound wrong, [^\s] should not match space. Thanks for reporting that, will investigate.

lsf37 commented 1 year ago

So, with just a single [^\s]+ everything appears to be working as expected, but with [^\n\-\s%]+ if there is also a . default rule present in the spec, I get an error at generation time, which is definitely a problem.

Not sure yet what is going on, but this is a minimal failing example:

%%
%%

[^\n\s]  {  }
.        {  }

The presence of all of \n, \s, . seems to be important, removing any of them leads to correct output.

Generator error looks like we're trying to access/emit a char class that doesn't exist:

Index -1 out of bounds for length 4
java.lang.IndexOutOfBoundsException: Index -1 out of bounds for length 4
    at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:100)
    at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:106)
    at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:302)
    at java.base/java.util.Objects.checkIndex(Objects.java:385)
    at java.base/java.util.ArrayList.get(ArrayList.java:427)
    at jflex.core.unicode.CharClasses.getIntervals(CharClasses.java:397)
    at jflex.core.unicode.CharClasses.computeTables(CharClasses.java:414)
    at jflex.core.unicode.CharClasses.getTables(CharClasses.java:466)
    at jflex.generator.Emitter.emitCharMapTables(Emitter.java:596)
    at jflex.generator.Emitter.emit(Emitter.java:1384)
    at jflex.generator.LexGenerator.generate(LexGenerator.java:111)
    at jflex.Main.generate(Main.java:341)
    at jflex.Main.main(Main.java:357)
lsf37 commented 1 year ago

Char class generation seems to have a problem -- in class 2 10 occurs twice, which should be impossible:

CharClasses:
class 0:
{ [0-8][14-31]['!'-132][134-159][161-5759][5761-8191][8203-8231][8234-8238][8240-8286][8288-12287][12289-55295][57344-1114111] }
class 1:
{ [9][' '][160][5760][8192-8202][8239][8287][12288] }
class 2:
{ [10-13][10][133][8232-8233] }
class 3:
{ [14-9][11-13][55296-57343] }
lsf37 commented 1 year ago

The bug is in the (new) code for normalisation of character classes. It uses a version of set subtraction (a - b) that is only safe when b is contained in a, which is not always the case.

In particular, the bug will trigger when we negate a character class that has overlapping contents, i.e. the \n is contained in \s in the failing test case, which means that the code first removes \n, and then is attempting to remove \s which is attempting to remove \n again. The operation as such succeeds, but leaves the set in an inconsistent state. That state only triggers a visible error in the next set operation that relies on the set invariants, which is why we need the second rule containing . to trigger anything visible (. having a non-empty intersection with [^\n\s]).

This also explains why replacing \s with just space fixes the problem, because the content elements of the negated character class don't intersect any more.

The fix should be fairly straightforward.