Open certik opened 2 years ago
After #310 is merged, here are the next steps:
indent
and dedent
tokens properly (i.e., handle indentation in the tokenizer) (#313 )It doesn't have to be perfect, it won't be perfect, but I would definitely do the above three things. Then we can move to the parser and start parsing it. And then we'll iterate on the design of the tokenizer as needed.
Here is how to access Python's tokenization:
$ cat a.py
def f():
"aabcd"
return 5
f()
$ python -m tokenize a.py
0,0-0,0: ENCODING 'utf-8'
1,0-1,3: NAME 'def'
1,4-1,5: NAME 'f'
1,5-1,6: OP '('
1,6-1,7: OP ')'
1,7-1,8: OP ':'
1,8-1,9: NEWLINE '\n'
2,0-2,4: INDENT ' '
2,4-2,11: STRING '"aabcd"'
2,11-2,12: NEWLINE '\n'
3,4-3,10: NAME 'return'
3,11-3,12: NUMBER '5'
3,12-3,13: NEWLINE '\n'
4,0-4,1: NL '\n'
5,0-5,0: DEDENT ''
5,0-5,1: NAME 'f'
5,1-5,2: OP '('
5,2-5,3: OP ')'
5,3-5,4: NEWLINE '\n'
6,0-6,0: ENDMARKER ''
$ lpython --show-tokens a.py
(KEYWORD "def") 0:2
(TOKEN "identifier" f) 4:4
(TOKEN "(") 5:5
(TOKEN ")") 6:6
(TOKEN ":") 7:7
(NEWLINE) 8:8
(TOKEN "string" "aabcd") 13:19
(NEWLINE) 20:20
(KEYWORD "return") 25:30
(TOKEN "integer" 5) 32:32
(NEWLINE) 33:33
(NEWLINE) 34:34
(TOKEN "identifier" f) 35:35
(TOKEN "(") 36:36
(TOKEN ")") 37:37
(NEWLINE) 38:38
(EOF) 39:39
Thanks for this, I'm now working on indent
and dedent
. After recognising both, I'll move on to the token comparison of Python and LPython.
Two ideas that we can explore later:
--show-ast
.Python Grammar: https://docs.python.org/3/reference/grammar.html
We're making great progress with our new parser. All the parser issues from the NumPy repo are reported (the good part is that we only have a few issues to be fixed). Now, moving on to test the SymPy repo
Great job @akshanshbhatt and @Thirumalai-Shaktivel !!
The re2c should track indentation level and insert INDENT / DEDENT tokens. Each INDENT will have a matching DEDENT.
It should return exactly the same token as CPython does:
Links: