lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
MIT License
4.61k stars 395 forks source link

Is it possible to parse parts of the input? #1424

Open jjalowie opened 2 weeks ago

jjalowie commented 2 weeks ago

What is your question?

I'm interested in writing a transformer for certain parts of input programs that wouldn't require a full language grammar. So let's say I have the below sample cpp code:

void function(int a, int b) {
  // body
}

I want to parse just function signatures and transform them e.g. by adding the '_' suffix to all argument names. I.e. the output for the above input program should be:

void function(int _a, int _b) {
  // body (unchanged)
}

How could I possibly achieve this using Lark?

MegaIng commented 2 weeks ago

See https://github.com/lark-parser/lark/issues/1371

It's not exactly easy and especially for C-like syntax that is already tricky to parse with lark it might be quite annoying.

jjalowie commented 2 weeks ago

Wow, thanks for a quick response. I'm thinking about moving on with Earley using the start: /.*/ rule /.*/ approach (my inputs aren't very big so should be fine performance-wise). I ran into the following problem though when experimenting around:

import lark
grammar = f"""
    start: /.*/ expr /.*/
    expr: "qwerty"
    %import common.WS
    %ignore WS
"""
parser = lark.Lark(grammar, start='start', parser='earley', lexer='dynamic_complete')

ast = parser.parse("""asdf qwerty poiuy""")
print(ast.pretty())

This gives the following error:

[...]
  File "/home/jjalowie/.bin/mlir_parser/venv/lib/python3.10/site-packages/lark/parser_frontends.py", line 215, in create_earley_parser
    return f(lexer_conf, parser_conf, resolve_ambiguity=resolve_ambiguity,
  File "/home/jjalowie/.bin/mlir_parser/venv/lib/python3.10/site-packages/lark/parser_frontends.py", line 192, in create_earley_parser__dynamic
    earley_matcher = EarleyRegexpMatcher(lexer_conf)
  File "/home/jjalowie/.bin/mlir_parser/venv/lib/python3.10/site-packages/lark/parser_frontends.py", line 178, in __init__
    raise GrammarError("Dynamic Earley doesn't allow zero-width regexps", t)
lark.exceptions.GrammarError: ("Dynamic Earley doesn't allow zero-width regexps", TerminalDef('__ANON_0', '.*'))

I'm using lark 1.1.9.

MegaIng commented 2 weeks ago

As it says in the error message, the terminal can't be zero width. Use /.+/? instead.

jjalowie commented 2 weeks ago

Hm, I can't figure out how to proceed. Here is an example:

import lark
grammar = f"""
    start: /.+/? expr+ /.+/?
    expr: "qwerty" | "asdf"
"""
parser = lark.Lark(grammar, start='start', parser='earley', lexer='dynamic_complete')
ast = parser.parse("""asdf qwerty""")
print(ast.pretty())
# output:
# start
#   expr
#    qwerty

I can't force asdf to be parsed as an expr because it gets parsed as /.+/? greedily. Any tips how to cope with that?

MegaIng commented 2 weeks ago

Use ambiguity='explicit' and manually select the correct parse, see https://github.com/lark-parser/lark/blob/master/examples/advanced/dynamic_complete.py

Or choose the easier and quicker root and use the scan function I coded up in the other issue.

jjalowie commented 2 weeks ago

From what I see https://github.com/MegaIng/lark/commit/497560879b0f88025b6c772a246dc0490814b061 doesn't expose a way to use transformers on the matched code. Could this be improved so I can also use transformers? What would be the needed steps?

For now I went with the Earley parser. I'm stuck on the below code. What should it behave like?

import lark

grammar = f"""
    start: /.+/? expr+ /.+/?
    expr: "asdf" | "qwerty"
"""

parser = lark.Lark(grammar, parser='earley', lexer='dynamic_complete', ambiguity='explicit')
ast = parser.parse("""asdf qwerty""")
print(ast.pretty())

It yields this output:

start
  expr
   qwerty

Shouldn't there be an ambiguity between /.*/? and expr visible in the AST this time because of ambiguity='explicit'? Isn't it a bug in lark?

MegaIng commented 2 weeks ago

Yeah, that does indeed look like a bug, maybe @erezsh has an idea what is going on.

With regard to scan: You have to do a bit of extra work right now, but you get the starting and end address of the segments that match, which you can then replace in the original. This would make a good recipe to add the documentation when scan gets officially added.

MegaIng commented 2 weeks ago

This function should do what you want using the scan method:

def scan_and_replace(parser: lark.Lark, text: str, replacement: Callable[[lark.ParseTree], str],
                     start: str = None) -> str:
    """
    Scans the `text` and replaces all matches of `parser` by the value returned by `replacement`
    given the corresponding tree.

    `start` is for passing in the start rule if required, not the starting position of the scanning.
    """
    last = 0
    res = ""
    for (start_pos, end_pos), tree in parser.scan(text, start=start):
        res += text[last:start_pos]
        res += replacement(tree)
        last = end_pos
    return res
jjalowie commented 2 weeks ago

This function should do what you want using the scan method: [...]

That looks very promising. I will pick up from there. Thanks a lot!

erezsh commented 2 weeks ago

For now I went with the Earley parser. I'm stuck on the below code. What should it behave like?


import lark

grammar = f"""
    start: /.+/? expr+ /.+/?
    expr: "asdf" | "qwerty"
"""

It's not a bug, you have a space in your text, and only ANY can handle it.

When I change to this grammar:

    !start: any? expr+ any?
    any: /.+/
    !expr: "asdf" | "qwerty"

    %ignore " "

I get -

_ambig
  start
    expr        asdf
    any  qwerty
  start
    expr        asdf
    any qwerty
  start
    expr        asdf
    expr        qwerty

(this is without dynamic_complete, which adds more derivations)

MegaIng commented 2 weeks ago

@erezsh No, what actually fixed it is moving /.+/? into a separate rule. These two grammars should have the same behavior, but don't:

grammar = f"""
    start: any? expr+ any?
    any: /.+/
    !expr: "asdf" | "qwerty"
"""
grammar = f"""
    start: /.+/? expr+ /.+/?
    !expr: "asdf" | "qwerty"
"""

The later doesn't produce any ambiguities with dynamic_complete, the former does.

Similar for your grammar, with dynamic_complete it produces fewer derivations when any is inlined.

erezsh commented 1 week ago

I tested @MegaIng 's example again on the latest master, and looks like this bug is fixed!