Open treykeown opened 4 months ago
I checked out the pegen
package, and while it's good, I'd like to stick with exactly what's in the CPython tree. I found a decent way to do that by first preprocessing the python.gram
file:
# ========================= START OF THE GRAMMAR =========================
# ========================= END OF THE GRAMMAR ===========================
invalid_*
rules(?<!["'])(?:{([^{}]*)})(?!["'])
\[[a-zA-Z_\*]+\]:
\[[a-zA-Z_\*]+\] \(memo\):
start: file
Then, generate the parser:
$ python -m pegen python ../../Grammar/clean-python.gram
Then replace the header of the generated parse.py
with this:
#!/usr/bin/env python
import tokenize
from typing import Any, Optional
from pegen.parser import memoize as _memoize, memoize_left_rec as _memoize_left_rec, logger as _logger, Parser
def tag_result(method_name: str, value: Any):
if method_name.startswith("_"):
return value
if not isinstance(value, list):
return value
if isinstance(value[0], str):
return value
return [method_name] + value
def memoize(method):
wrapped = _memoize(method)
def wrapper(*args, **kwargs) -> Any:
return tag_result(method.__name__, wrapped(*args, **kwargs))
return wrapper
def memoize_left_rec(method):
wrapped = _memoize_left_rec(method)
def wrapper(*args, **kwargs) -> Any:
return tag_result(method.__name__, wrapped(*args, **kwargs))
return wrapper
def logger(method):
wrapped = _logger(method)
def wrapper(*args, **kwargs) -> Any:
return tag_result(method.__name__, wrapped(*args, **kwargs))
return wrapper
# noinspection PyUnboundLocalVariable,SpellCheckingInspection,PyArgumentList,PyShadowingBuiltins,PyUnusedLocal
class GeneratedParser(Parser):
# The FSTRING_* functions are here because they were never added to pegen... that should be fixed!
@memoize
def FSTRING_START(self) -> Optional[tokenize.TokenInfo]:
tok = self._tokenizer.peek()
if tok.type == FSTRING_START:
return self._tokenizer.getnext()
return None
@memoize
def FSTRING_MIDDLE(self) -> Optional[tokenize.TokenInfo]:
tok = self._tokenizer.peek()
if tok.type == FSTRING_MIDDLE:
return self._tokenizer.getnext()
return None
@memoize
def FSTRING_END(self) -> Optional[tokenize.TokenInfo]:
tok = self._tokenizer.peek()
if tok.type == FSTRING_END:
return self._tokenizer.getnext()
return None
# Leave everything below `class GeneratedParser(Parser):` in the original file here...
... and now you have a working Python parser, generated straight from the grammar file.
Important to note: it won't build a detailed AST. It will only build a tree showing what rules apply to which token. That's plenty detailed for our purposes with our custom XML rules, but this won't work for most Python parsing applications.
To use the parser:
from parse import GeneratedParser
from tokenize import generate_tokens
from pegen.tokenizer import Tokenizer
with open("~/test.py") as f:
tokengen = generate_tokens(f.readline)
tok = Tokenizer(tokengen)
p = GeneratedParser(tok)
out = p.start()
I'd like to transition to peggy. It's a thin wrapper over the version of pegen distributed with CPython (though that version could use some fixes). Since it can generate a parser from Python's canonical grammar file, it'll probably be the easiest to keep in sync with future language changes.
pegen
is the parser generator used natively by Python. It supports generating both C code and Python code. I think it's preferable to use this overparso
, though both seem well maintained. Worth a look.