Add EOF symbol to match end of input

erezsh commented 6 years ago

Can be signified with $

Must be optional (grammars don't have to contain it to be correct)

Need to make sure it works for both LALR and Earley, and with indentation.

isidentical commented 5 years ago

Any update?

erezsh commented 5 years ago

I have a version that should be working for LALR, in branch end_symbol

If anyone wants to test it and let me know how it goes, that would be great!

(that's https://github.com/lark-parser/lark/tree/end_symbol)

@isidentical @jruales @coreyt

jisaacstone commented 4 years ago

I did a rebase of this onto the master branch. There was a conflict in lalr_parser.py that I'm not sure I resolved correctly.

https://github.com/jisaacstone/lark/blob/end_symbol/tests/test_parser.py#L1650-L1660

I but all the tests seem to pass. To be sure I added another test in and it failed

https://github.com/jisaacstone/lark/blob/end_symbol/tests/test_parser.py#L1650-L1660

======================================================================
ERROR: test_end_symbol2 (tests.test_parser._make_parser_test.<locals>._TestParser)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/larpy/lark/tests/test_parser.py", line 1659, in test_end_symbol2
    self.assertEqual(parser.parse('axa'), Tree('start', [Tree('a', []),Tree('a', [])]))
  File "/Users/larpy/lark/lark/lark.py", line 293, in parse
    return self.parser.parse(text, start=start)
  File "/Users/larpy/lark/lark/parser_frontends.py", line 88, in parse
    return self._parse(token_stream, start)
  File "/Users/larpy/lark/lark/parser_frontends.py", line 54, in _parse
    return self.parser.parse(input, start, *args)
  File "/Users/larpy/lark/lark/parsers/lalr_parser.py", line 36, in parse
    return self.parser.parse(*args)
  File "/Users/larpy/lark/lark/parsers/lalr_parser.py", line 110, in parse
    reduce(arg)
  File "/Users/larpy/lark/lark/parsers/lalr_parser.py", line 77, in reduce
    value = self.callbacks[rule](s)
  File "/Users/larpy/lark/lark/parse_tree_builder.py", line 28, in __call__
    res = self.node_builder(children)
  File "/Users/larpy/lark/lark/parse_tree_builder.py", line 121, in __call__
    filtered.append(children[i])
IndexError: list index out of range

----------------------------------------------------------------------

I need to dive deep to get a better understanding of what is going on here exactly, but meanwhile does anything look off about the code I linked or the test I wrote?

erezsh commented 4 years ago

The test seems fine. I think the exception happens because some of the new preprocessing code isn't aware that $ isn't a real terminal (in the sense that it doesn't match any text)

I'll look into it soon, unless you'll manage to work it out sooner.

jisaacstone commented 4 years ago

I've done a bit more digging

2 things that may or may not be relevant

1st: above test fails with the same error on the original branch (before rebase)

2nd: Altering the test slightly produces a different error:

        @unittest.skipIf(PARSER!='lalr', "Using the end symbol currently works for LALR only")
        def test_end_symbol2(self):
            grammar = """
                start: (a|b)+
                a: "a" (E|ND)
                b: "b"
                E: $
                ND: "x"
            """
            parser = _Lark(grammar)

            self.assertEqual(parser.parse('axa'), Tree('start', [Tree('a', []),Tree('a', [])]))
            self.assertRaises(UnexpectedInput, parser.parse, 'ab')

======================================================================
FAIL: test_end_symbol2 (tests.test_parser._make_parser_test.<locals>._TestParser)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/larpy/lark/tests/test_parser.py", line 1659, in test_end_symbol2
    parser = _Lark(grammar)
  File "/Users/larpy/lark/tests/test_parser.py", line 535, in _Lark
    return Lark(grammar, lexer=lexer_class_or_name, parser=PARSER, propagate_positions=True, **kwargs)
  File "/Users/larpy/lark/lark/lark.py", line 206, in __init__
    self.terminals, self.rules, self.ignore_tokens = self.grammar.compile(self.options.start)
  File "/Users/larpy/lark/lark/load_grammar.py", line 494, in compile
    for name, (term_tree, priority) in term_defs if term_tree]
  File "/Users/larpy/lark/lark/load_grammar.py", line 494, in <listcomp>
    for name, (term_tree, priority) in term_defs if term_tree]
  File "/Users/larpy/lark/lark/lexer.py", line 69, in __init__
    assert isinstance(pattern, Pattern), pattern
AssertionError: None

----------------------------------------------------------------------

In the rebase case this fails in a different way, because an assert has been added

https://github.com/lark-parser/lark/blob/master/lark/load_grammar.py#L637

removing the assert and it fails in the same way.

Hope these findings are helpful. About to go vacation so probably won't get back to looking at this for a couple weeks at least

erezsh commented 4 years ago

Hi @jisaacstone,

I just created a new branch, end_symbol3, which merges end_symbol with the most recent master.

I don't get the same exceptions as you. In fact, everything works perfectly for me.

I also added the two tests you outline here (renamed the 2nd version to test_end_symbol3, with a few corrections), and they all seem to pass.

Let me know if you think I missed something. Otherwise, all you have left to do, is to get it to work for Earley.

Enjoy your vacation!

jisaacstone commented 4 years ago

OK Looks like my original thought was correct and this section of code i did not resolve the conflicts correctly

https://github.com/jisaacstone/lark/blob/end_symbol/tests/test_parser.py#L1650-L1660

Should ought to have gone back to that when I hit trouble. Sorry for the misdirection

jnwatson commented 4 years ago

+1 for $ to represent end of string. I just ran into a need for this.

erezsh commented 4 years ago

Just pushed this into master. You can now use $ in the LALR parser. (No Earley support yet)

Here's an example from the tests that shows what it does:

def test_end_symbol2(self):
    grammar = """
        start: (a|b)+
        a: "a" ("x"|$)
        b: "b"
    """
    parser = _Lark(grammar)

   self.assertEqual(parser.parse('axa'), Tree('start', [Tree('a', []),Tree('a', [])]))
   self.assertRaises(UnexpectedInput, parser.parse, 'ab')

kneufeld commented 4 years ago

I've just checked master and I can't find test_end_symbol2 and the example grammar listed above doesn't import.

E           lark.exceptions.GrammarError: Unexpected input at line 2 column 13 in calculator/grammar.lark:
E
E           a: "a" ("x"|$)
E                       ^

lark/lark/load_grammar.py:890: GrammarError

I don't claim to be even remotely be an expert on parsing but not being able to match $ seems to make even the simplest of grammars unparseable.

For instance (simplified):

start: expr+

expr: expr OPER expr
      | SIGNED_NUMBER

OPER: "+" | "-" | "*" | "/"

parse("1-1") # Tree('start', [Tree('statement', [1.0]), Tree('statement', [-1.0])])

whereas start: (expr ("\n"|$))+ would actually pick expr oper expr

kneufeld commented 4 years ago

Looks like you reverted in 3bee21051e8440e506ea13f45b224e2f6d668662 but you don't say why.

MegaIng commented 4 years ago

If I remember correctly, there were some problems with unexpected side effects.

I don't claim to be even remotely be an expert on parsing but not being able to match $ seems to make even the simplest of grammars unparseable.

Maybe, but your example is wrong:

start: expr ("\n" expr?)*

works.

kneufeld commented 4 years ago

So it does... again, not an expert, but that's very non intuitive.

Thanks for the help!

erezsh commented 4 years ago

@kneufeld I reverted it because it was buggy.

I agree it would be a nice feature, but I don't think it's a necessity.

ThatXliner commented 3 years ago

Is there any update? I kind of need this. Or is there a workaround?

erezsh commented 3 years ago

I'll see what I can do

erezsh commented 3 years ago

@ThatXliner See PR #880

lark-parser / lark

Add EOF symbol to match end of input #237