lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
MIT License
4.77k stars 406 forks source link

Mandatory ignored tokens and zero-width ignored tokens #1210

Closed pharmpy-dev-123 closed 1 year ago

pharmpy-dev-123 commented 1 year ago

Consider the following two scenarios:

  1. One wants to force the presence of a generally ignored token. For instance a mandatory space between tokens A and B where the characters of B cannot be the suffix of A.
  2. One wants to handle "line continuation" character sequences that can be present in the middle of any token. This is not exactly the same as ignoring tokens since tokens would not follow the invariant token == source[token.start_pos:token.end_pos].

Are there any known workarounds to these problems. Both seem to require some sort of custom lexer but just asking to make sure I am not overlooking some simple solution.

erezsh commented 1 year ago

Sorry I missed this question.

  1. You can include ignored tokens in your grammar, and then they would be forced to match. Like A _WS B. Maybe you can also you regexp lookaheads ((?=) and (?!) syntax).
  2. I don't think there's a simple solution. You'll have to include those characters in the definition of every token.Later you can filter it out when transforming. In this case, yes, it's probably preferable to write a custom lexer.
pharmpy-dev-123 commented 1 year ago

@erezsh Thanks for your answer! Will the solution to 1 also work if we have both A _WS B and %ignore WS?

erezsh commented 1 year ago

Yes, iirc it should work.

erezsh commented 1 year ago

(but it should be %ignore _WS)