osa1 / lexgen

A fully-featured lexer generator, implemented as a proc macro
MIT License
63 stars 7 forks source link

Overlapping ranges break lexing #2

Closed osa1 closed 3 years ago

osa1 commented 3 years ago

Here's the test:

lexer! {
    Lexer -> usize;

    ['a'-'b'] '1' = 1,
    ['a'-'c'] '2' => 2,
}

let mut lexer = Lexer::new("a1");
assert_eq!(ignore_pos(lexer.next()), Some(Ok(1)));
assert_eq!(lexer.next(), None);

let mut lexer = Lexer::new("a2");
assert_eq!(ignore_pos(lexer.next()), Some(Ok(2)));
assert_eq!(lexer.next(), None);

Lexing a1 works fine but we can't lex a2. I don't remember the code for adding ranges now and I didn't check yet, but I think bug is in handling of ranges: we implicitly assume character ranges don't overlap. In the example above, when adding the range 'a'-'c', we need to consider existing ranges at the same state ('a'-'b'), and make sure that the overlapping parts of the ranges will non-deterministically switch to the states of all of the overlapping ranges. So when 'a'-'b' matches we need to go to states for both '1' and '2'.

osa1 commented 3 years ago

A simple way to fix this would be to "desugar" ranges into character sets. So if I have ['a'-'c'] that would be desugared to ['a' 'b' 'c']. Now we don't have to deal with ranges and range overlap checks anymore.

If we do this we will have to be smarter in the backend, as a naive implementation will crate a branch in the generated code for each character in the range. So if I have ['a'-z'], assuming no overlaps, what would previously be one range test (char >= 'a' && char <= 'z') will now be 26 tests. Also, the intermediate representation of the NFAs and DFAs will be much larger for the same reason.

osa1 commented 3 years ago

The issue is fixed in the range_set branch. I implemented a simple "range map" that maps ranges into values. Insertion allows overlaps, so if I have a range (10, 20) mapped to x and add another range (15, 25) to y, I get three ranges:

Using this type instead of the previous FxHashMap<(char, char), StateSet> automatically fixes the issue as the new type handles overlapping ranges by taking union of the overlapping range's values as the value of the overlap.

Only concern is the code is fairly complicated so I'm not that sure about the correctness. I have a few unit tests and they all pass, but ideally I'd like some kind of proptest tests.