pyxy-org / pyxy

HTML in Python
MIT License
8 stars 0 forks source link

Move away from `parso`? #1

Open treykeown opened 4 months ago

treykeown commented 4 months ago

pegen is the parser generator used natively by Python. It supports generating both C code and Python code. I think it's preferable to use this over parso, though both seem well maintained. Worth a look.

treykeown commented 4 months ago

I checked out the pegen package, and while it's good, I'd like to stick with exactly what's in the CPython tree. I found a decent way to do that by first preprocessing the python.gram file:

  1. Clip to start and end:
    • Start: # ========================= START OF THE GRAMMAR =========================
    • End: # ========================= END OF THE GRAMMAR ===========================
  2. Remove references to invalid_* rules
  3. Strip braces: (?<!["'])(?:{([^{}]*)})(?!["'])
  4. Remove type declarations:
    • Normal: \[[a-zA-Z_\*]+\]:
    • Memo: \[[a-zA-Z_\*]+\] \(memo\):
  5. Append start: file

Then, generate the parser:

$ python -m pegen python ../../Grammar/clean-python.gram

Then replace the header of the generated parse.py with this:

#!/usr/bin/env python

import tokenize

from typing import Any, Optional

from pegen.parser import memoize as _memoize, memoize_left_rec as _memoize_left_rec, logger as _logger, Parser

def tag_result(method_name: str, value: Any):
    if method_name.startswith("_"):
        return value
    if not isinstance(value, list):
        return value
    if isinstance(value[0], str):
        return value
    return [method_name] + value

def memoize(method):
    wrapped = _memoize(method)
    def wrapper(*args, **kwargs) -> Any:
        return tag_result(method.__name__, wrapped(*args, **kwargs))
    return wrapper

def memoize_left_rec(method):
    wrapped = _memoize_left_rec(method)
    def wrapper(*args, **kwargs) -> Any:
        return tag_result(method.__name__, wrapped(*args, **kwargs))
    return wrapper

def logger(method):
    wrapped = _logger(method)
    def wrapper(*args, **kwargs) -> Any:
        return tag_result(method.__name__, wrapped(*args, **kwargs))
    return wrapper

# noinspection PyUnboundLocalVariable,SpellCheckingInspection,PyArgumentList,PyShadowingBuiltins,PyUnusedLocal
class GeneratedParser(Parser):
    # The FSTRING_* functions are here because they were never added to pegen... that should be fixed!
    @memoize
    def FSTRING_START(self) -> Optional[tokenize.TokenInfo]:
        tok = self._tokenizer.peek()
        if tok.type == FSTRING_START:
            return self._tokenizer.getnext()
        return None

    @memoize
    def FSTRING_MIDDLE(self) -> Optional[tokenize.TokenInfo]:
        tok = self._tokenizer.peek()
        if tok.type == FSTRING_MIDDLE:
            return self._tokenizer.getnext()
        return None

    @memoize
    def FSTRING_END(self) -> Optional[tokenize.TokenInfo]:
        tok = self._tokenizer.peek()
        if tok.type == FSTRING_END:
            return self._tokenizer.getnext()
        return None

# Leave everything below `class GeneratedParser(Parser):` in the original file here...

... and now you have a working Python parser, generated straight from the grammar file.

Important to note: it won't build a detailed AST. It will only build a tree showing what rules apply to which token. That's plenty detailed for our purposes with our custom XML rules, but this won't work for most Python parsing applications.

To use the parser:

from parse import GeneratedParser
from tokenize import generate_tokens
from pegen.tokenizer import Tokenizer
with open("~/test.py") as f:
    tokengen = generate_tokens(f.readline)
    tok = Tokenizer(tokengen)
    p = GeneratedParser(tok)
    out = p.start()
treykeown commented 4 months ago

I'd like to transition to peggy. It's a thin wrapper over the version of pegen distributed with CPython (though that version could use some fixes). Since it can generate a parser from Python's canonical grammar file, it'll probably be the easiest to keep in sync with future language changes.