skvadrik / re2c

Lexer generator for C, C++, Go and Rust.
https://re2c.org
Other
1.07k stars 169 forks source link

How to support Python style indentation? #397

Closed lijunchen closed 1 year ago

lijunchen commented 2 years ago

Are there any examples of handling indentation?

For example: test.py

if a:
    if b:
      if c:
         if d:
             pass
         pass
      pass
    pass
$ python3 -m tokenize -e ./test.py
0,0-0,0:            ENCODING       'utf-8'
1,0-1,2:            NAME           'if'
1,3-1,4:            NAME           'a'
1,4-1,5:            COLON          ':'
1,5-1,6:            NEWLINE        '\n'
2,0-2,4:            INDENT         '    '
2,4-2,6:            NAME           'if'
2,7-2,8:            NAME           'b'
2,8-2,9:            COLON          ':'
2,9-2,10:           NEWLINE        '\n'
3,0-3,6:            INDENT         '      '
3,6-3,8:            NAME           'if'
3,9-3,10:           NAME           'c'
3,10-3,11:          COLON          ':'
3,11-3,12:          NEWLINE        '\n'
4,0-4,9:            INDENT         '         '
4,9-4,11:           NAME           'if'
4,12-4,13:          NAME           'd'
4,13-4,14:          COLON          ':'
4,14-4,15:          NEWLINE        '\n'
5,0-5,13:           INDENT         '             '
5,13-5,17:          NAME           'pass'
5,17-5,18:          NEWLINE        '\n'
6,9-6,9:            DEDENT         ''
6,9-6,13:           NAME           'pass'
6,13-6,14:          NEWLINE        '\n'
7,6-7,6:            DEDENT         ''
7,6-7,10:           NAME           'pass'
7,10-7,11:          NEWLINE        '\n'
8,4-8,4:            DEDENT         ''
8,4-8,8:            NAME           'pass'
8,8-8,9:            NEWLINE        '\n'
9,0-9,1:            NL             '\n'
10,0-10,0:          DEDENT         ''
10,0-10,0:          ENDMARKER      ''
skvadrik commented 2 years ago

There is no automatic indentation or location handling. You can have a rule with tags surrounding indentation, like this:

    @x space* @y something { indent = (y - x) / 4; ... }
pmetzger commented 2 years ago

The way many parsers handle things like this is to keep track of the indentation levels in some state variable and to issue synthetic indent and unindent tokens whenever whitespace at the start of line is encountered that does not conform to the previous indentation level.

lijunchen commented 2 years ago

Thanks all, I find a solution by using indent stack and tags, it works but there are many corner cases. When I can fully support tokenizing Python3.10, I will update this issue.

My current solution: https://github.com/lijunchen/pyser/blob/ead8f46a2847905d4757ed194c604d0ca493c2f0/src/tokenizer.re2c Indent stack: https://matt.might.net/articles/standalone-lexers-with-lex/)