Closed hurricup closed 1 year ago
Hm, that does sound wrong, [^\s]
should not match space. Thanks for reporting that, will investigate.
So, with just a single [^\s]+
everything appears to be working as expected, but with [^\n\-\s%]+
if there is also a .
default rule present in the spec, I get an error at generation time, which is definitely a problem.
Not sure yet what is going on, but this is a minimal failing example:
%%
%%
[^\n\s] { }
. { }
The presence of all of \n
, \s
, .
seems to be important, removing any of them leads to correct output.
Generator error looks like we're trying to access/emit a char class that doesn't exist:
Index -1 out of bounds for length 4
java.lang.IndexOutOfBoundsException: Index -1 out of bounds for length 4
at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:100)
at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:106)
at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:302)
at java.base/java.util.Objects.checkIndex(Objects.java:385)
at java.base/java.util.ArrayList.get(ArrayList.java:427)
at jflex.core.unicode.CharClasses.getIntervals(CharClasses.java:397)
at jflex.core.unicode.CharClasses.computeTables(CharClasses.java:414)
at jflex.core.unicode.CharClasses.getTables(CharClasses.java:466)
at jflex.generator.Emitter.emitCharMapTables(Emitter.java:596)
at jflex.generator.Emitter.emit(Emitter.java:1384)
at jflex.generator.LexGenerator.generate(LexGenerator.java:111)
at jflex.Main.generate(Main.java:341)
at jflex.Main.main(Main.java:357)
Char class generation seems to have a problem -- in class 2 10
occurs twice, which should be impossible:
CharClasses:
class 0:
{ [0-8][14-31]['!'-132][134-159][161-5759][5761-8191][8203-8231][8234-8238][8240-8286][8288-12287][12289-55295][57344-1114111] }
class 1:
{ [9][' '][160][5760][8192-8202][8239][8287][12288] }
class 2:
{ [10-13][10][133][8232-8233] }
class 3:
{ [14-9][11-13][55296-57343] }
The bug is in the (new) code for normalisation of character classes. It uses a version of set subtraction (a - b) that is only safe when b is contained in a, which is not always the case.
In particular, the bug will trigger when we negate a character class that has overlapping contents, i.e. the \n
is contained in \s
in the failing test case, which means that the code first removes \n
, and then is attempting to remove \s
which is attempting to remove \n
again. The operation as such succeeds, but leaves the set in an inconsistent state. That state only triggers a visible error in the next set operation that relies on the set invariants, which is why we need the second rule containing .
to trigger anything visible (.
having a non-empty intersection with [^\n\s]
).
This also explains why replacing \s
with just space fixes the problem, because the content elements of the negated character class don't intersect any more.
The fix should be fairly straightforward.
I'm currently migrating from the 1.7.0 to 1.9.0 and my test uncovered that in 1.7.0
[^\n\-\s%]+
did not matched spaces, but in 1.9.0 it does. And now I need to write it like[^\n\- \t%]+
And this feels wrong.https://regex101.com/r/eOBDPX/1