certik commented 2 years ago

The re2c should track indentation level and insert INDENT / DEDENT tokens. Each INDENT will have a matching DEDENT.

It should return exactly the same token as CPython does:

https://github.com/python/cpython/blob/755be9b1505af591b9f2ee424a6525b6c2b65ce9/Grammar/Tokens

Links:

Thirumalai-Shaktivel commented 2 years ago

https://github.com/yuangao2/Python_Interpreter/blob/master/includes/scan.l https://github.com/python/cpython/blob/main/Parser/token.c https://github.com/python/cpython/tree/main/Parser

Thirumalai-Shaktivel commented 2 years ago

https://github.com/python/cpython/blob/main/Include/token.h

certik commented 2 years ago

After #310 is merged, here are the next steps:

[x] Emit indent and dedent tokens properly (i.e., handle indentation in the tokenizer) (#313 )
[x] Access the tokens from CPython somehow (figure out how to do that), and then compare against it, that we are getting exactly those tokens and nothing more or less.
[x] Remove all the Fortran stuff from the tokenizer (#314)

It doesn't have to be perfect, it won't be perfect, but I would definitely do the above three things. Then we can move to the parser and start parsing it. And then we'll iterate on the design of the tokenizer as needed.

certik commented 2 years ago

Here is how to access Python's tokenization:

$ cat a.py 
def f():
    "aabcd"
    return 5

f()
$ python -m tokenize a.py   
0,0-0,0:            ENCODING       'utf-8'        
1,0-1,3:            NAME           'def'          
1,4-1,5:            NAME           'f'            
1,5-1,6:            OP             '('            
1,6-1,7:            OP             ')'            
1,7-1,8:            OP             ':'            
1,8-1,9:            NEWLINE        '\n'           
2,0-2,4:            INDENT         '    '         
2,4-2,11:           STRING         '"aabcd"'      
2,11-2,12:          NEWLINE        '\n'           
3,4-3,10:           NAME           'return'       
3,11-3,12:          NUMBER         '5'            
3,12-3,13:          NEWLINE        '\n'           
4,0-4,1:            NL             '\n'           
5,0-5,0:            DEDENT         ''             
5,0-5,1:            NAME           'f'            
5,1-5,2:            OP             '('            
5,2-5,3:            OP             ')'            
5,3-5,4:            NEWLINE        '\n'           
6,0-6,0:            ENDMARKER      ''             
$ lpython --show-tokens a.py
(KEYWORD "def") 0:2
(TOKEN "identifier" f) 4:4
(TOKEN "(") 5:5
(TOKEN ")") 6:6
(TOKEN ":") 7:7
(NEWLINE) 8:8
(TOKEN "string" "aabcd") 13:19
(NEWLINE) 20:20
(KEYWORD "return") 25:30
(TOKEN "integer" 5) 32:32
(NEWLINE) 33:33
(NEWLINE) 34:34
(TOKEN "identifier" f) 35:35
(TOKEN "(") 36:36
(TOKEN ")") 37:37
(NEWLINE) 38:38
(EOF) 39:39

Thirumalai-Shaktivel commented 2 years ago

Thanks for this, I'm now working on indent and dedent. After recognising both, I'll move on to the token comparison of Python and LPython.

Thirumalai-Shaktivel commented 2 years ago

http://re2c.org/examples/c/real_world/example_cxx98.html

certik commented 2 years ago

https://docs.python.org/3/reference/lexical_analysis.html

certik commented 2 years ago

Two ideas that we can explore later:

Automatic comparison with Python tokens. We should do it later, since perhaps exact adherence to Python's tokens is not what we want (it might turn out we need to make some slight deviations to make our parser easier)
Possibly generate the bison parser's actions automatically; in LFortran we use macros, perhaps the name of the macros can be generated automatically, or even the AST nodes can be generated automatically; perhaps not in all cases, but in a lot of cases, the rest we still do manually.

certik commented 2 years ago

337 adds an initial Bison-based parser. #338 hooks it up for `--show-ast`.

Thirumalai-Shaktivel commented 2 years ago

Python Grammar: https://docs.python.org/3/reference/grammar.html

Thirumalai-Shaktivel commented 2 years ago

We're making great progress with our new parser. All the parser issues from the NumPy repo are reported (the good part is that we only have a few issues to be fixed). Now, moving on to test the SymPy repo

certik commented 2 years ago

Great job @akshanshbhatt and @Thirumalai-Shaktivel !!

lcompilers / lpython

Write a re2c+Bison parser for Python #298

337 adds an initial Bison-based parser. #338 hooks it up for `--show-ast`.

lcompilers / lpython

Write a re2c+Bison parser for Python #298

337 adds an initial Bison-based parser. #338 hooks it up for --show-ast.

337 adds an initial Bison-based parser. #338 hooks it up for `--show-ast`.