lcompilers / lpython

Python compiler
https://lpython.org/
Other
1.5k stars 163 forks source link

Write a re2c+Bison parser for Python #298

Open certik opened 2 years ago

certik commented 2 years ago

The re2c should track indentation level and insert INDENT / DEDENT tokens. Each INDENT will have a matching DEDENT.

It should return exactly the same token as CPython does:

Links:

Thirumalai-Shaktivel commented 2 years ago

https://github.com/yuangao2/Python_Interpreter/blob/master/includes/scan.l https://github.com/python/cpython/blob/main/Parser/token.c https://github.com/python/cpython/tree/main/Parser

Thirumalai-Shaktivel commented 2 years ago

https://github.com/python/cpython/blob/main/Include/token.h

certik commented 2 years ago

After #310 is merged, here are the next steps:

It doesn't have to be perfect, it won't be perfect, but I would definitely do the above three things. Then we can move to the parser and start parsing it. And then we'll iterate on the design of the tokenizer as needed.

certik commented 2 years ago

Here is how to access Python's tokenization:

$ cat a.py 
def f():
    "aabcd"
    return 5

f()
$ python -m tokenize a.py   
0,0-0,0:            ENCODING       'utf-8'        
1,0-1,3:            NAME           'def'          
1,4-1,5:            NAME           'f'            
1,5-1,6:            OP             '('            
1,6-1,7:            OP             ')'            
1,7-1,8:            OP             ':'            
1,8-1,9:            NEWLINE        '\n'           
2,0-2,4:            INDENT         '    '         
2,4-2,11:           STRING         '"aabcd"'      
2,11-2,12:          NEWLINE        '\n'           
3,4-3,10:           NAME           'return'       
3,11-3,12:          NUMBER         '5'            
3,12-3,13:          NEWLINE        '\n'           
4,0-4,1:            NL             '\n'           
5,0-5,0:            DEDENT         ''             
5,0-5,1:            NAME           'f'            
5,1-5,2:            OP             '('            
5,2-5,3:            OP             ')'            
5,3-5,4:            NEWLINE        '\n'           
6,0-6,0:            ENDMARKER      ''             
$ lpython --show-tokens a.py
(KEYWORD "def") 0:2
(TOKEN "identifier" f) 4:4
(TOKEN "(") 5:5
(TOKEN ")") 6:6
(TOKEN ":") 7:7
(NEWLINE) 8:8
(TOKEN "string" "aabcd") 13:19
(NEWLINE) 20:20
(KEYWORD "return") 25:30
(TOKEN "integer" 5) 32:32
(NEWLINE) 33:33
(NEWLINE) 34:34
(TOKEN "identifier" f) 35:35
(TOKEN "(") 36:36
(TOKEN ")") 37:37
(NEWLINE) 38:38
(EOF) 39:39
Thirumalai-Shaktivel commented 2 years ago

Thanks for this, I'm now working on indent and dedent. After recognising both, I'll move on to the token comparison of Python and LPython.

Thirumalai-Shaktivel commented 2 years ago

http://re2c.org/examples/c/real_world/example_cxx98.html

certik commented 2 years ago

https://docs.python.org/3/reference/lexical_analysis.html

certik commented 2 years ago

Two ideas that we can explore later:

certik commented 2 years ago

337 adds an initial Bison-based parser. #338 hooks it up for --show-ast.

Thirumalai-Shaktivel commented 2 years ago

Python Grammar: https://docs.python.org/3/reference/grammar.html

Thirumalai-Shaktivel commented 2 years ago

We're making great progress with our new parser. All the parser issues from the NumPy repo are reported (the good part is that we only have a few issues to be fixed). Now, moving on to test the SymPy repo

certik commented 2 years ago

Great job @akshanshbhatt and @Thirumalai-Shaktivel !!