softdevteam / grmtools

Rust grammar tool libraries and binaries
Other
518 stars 31 forks source link

lrlex: match beginning of line #449

Closed liluyue closed 6 months ago

liluyue commented 6 months ago

I am working on a markdown parser. Many of its tags, such as "#", match from the beginning of the line. ^ does not work in lrlex, and (? m) ^ also does not work

ltratt commented 6 months ago

lrlex is, intentionally, fairly simplistic. If you want to do a whitespace sensitive language (e.g. one that requires "start of line"), particularly one with some unusual rules such as markdown's, you'll want to write a hand-written lexer. The good news is that lrpar can happily work with a hand-written parser: see https://softdevteam.github.io/grmtools/master/book/manuallexer.html.

ratmice commented 6 months ago

I haven't tried to see what causes it not to work, but I'm a little bit suprised because the default for RegexBuilder::multi_line within lrlex is true. But wanted to note that perhaps there are other regex options that affect the behavior.

It is worth noting that since 1.9.x regex has added RegexBuilder::crlf, so I'm curious if perhaps you are testing with crlf data? Perhaps we could add support for that option in CTLexerBuilder.

Anyhow besides a manual lexer, perhaps there are options to RegexBuilder, which could changed to make this work?

liluyue commented 6 months ago

I haven't tried to see what causes it not to work, but I'm a little bit suprised because the default for RegexBuilder::multi_line within lrlex is true. But wanted to note that perhaps there are other regex options that affect the behavior.

It is worth noting that since 1.9.x regex has added RegexBuilder::crlf, so I'm curious if perhaps you are testing with crlf data? Perhaps we could add support for that option in CTLexerBuilder.

Anyhow besides a manual lexer, perhaps there are options to RegexBuilder, which could changed to make this work?

because that String truncation causes loss of row information:

截屏2024-05-13 13 07 56 截屏2024-05-13 13 08 22
ratmice commented 6 months ago

Ahh, indeed the \A in your screenshot, and beginning to recall how the algorithm used in lrlex behaves it now makes sense why this doesn't just work already.

So indeed it seems like a manual lexer might be the only way to achieve this.