lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
MIT License
4.77k stars 404 forks source link

Help with my grammar, getting a rule* to end using negative lookahead works, how else to do it using rules or terminals? #1248

Closed mbaily closed 1 year ago

mbaily commented 1 year ago

I have a lark grammar:

lark1_grammar = r"""
start: (item | other_line | item_content_line)*
other_line: LINE? _NL
item.2: header item_content_line* dashed_line_end?
header: HEADER_NAME _NL DASHED_LINE _NL
item_content_line: ITEM_CONTENT_LINE _NL
dashed_line_end: DASHED_LINE _NL

ITEM_CONTENT_LINE: /(?!^-{100}-*$|^[A-Z]+$\n^-{100}-*$)^.+$/m
LINE: /^.+$/m
HEADER_NAME: /^[A-Z]+$/m
DASHED_LINE: /^-{100}-*$/m
%import common.NEWLINE -> _NL
"""

And a test file:

other_line1
WINGETAA
-----------------------------------------------------------------------------------------------------------------------
column1   222  2 2222222   column3     column4    column5        column6 6666   777777 777

content_line_2           column2 Column2   column3  column3
-----------------------------------------------------------------------------------------------------------------------
WINGETBB
-----------------------------------------------------------------------------------------------------------------------
content_line_3
content_line_4
content_line_5
-----------------------------------------------------------------------------------------------------------------------
other_line_1
-----------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------
other_line_2
-----------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------
WINGETPPPP
-----------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------
content_line_6
WINGETCC
-----------------------------------------------------------------------------------------------------------------------
content_line_7
content_line_8
-----------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------

The problem is the rule item.2: header item_content_line* dashed_line_end?, specifically the item_content_line* part that ends up consuming more lines than it should. Like dashed lines, for example, which are supposed to end the item.

I am using the negative lookahead assertion in ITEM_CONTENT_LINE to stop more item_content_line rules being followed. THis works.

Is there another way to do it with rules or terminals? The item_content_line is supposed to end with either a dashed line or a header followed by a dashed line:

WINGETAA
-----------------------------------------------------------------------------------------------------------------------

The item consists of a header like 'WINGET' at the start of a line followed by dashed line, then the content, and then ending with a dashed line, or, if the next header comes first, that starts a new item first before the ending dashed line. Other lines, other stuff in the file can exist in between the "items" starting with the headers.

But I can't get it to work without using the negative lookahead in ITEM_CONTENT_LINE (at the beginning of ITEM_CONTENT_LINE).

The item_content_line* should end as soon as it hits a dashed line, or a header followed by a dashed line. Then it should go to other_line, dashed_line_end, or another new item.

But if I replace the ITEM_CONTENT_LINE terminal with just the plain LINE terminal, the item_content_line* starts consuming dashed lines which are supposed to end the item straight away.

Like this buggy output:

start
  other_line    other_line1
  item
    header
      WINGETAA
      -----------------------------------------------------------------------------------------------------------------------
    item_content_line   column1   222  2 2222222   column3     column4    column5        column6 6666   777777 777
    item_content_line   content_line_2           column2 Column2   column3  column3
    dashed_line_end     -----------------------------------------------------------------------------------------------------------------------
  item
    header
      WINGETBB
      -----------------------------------------------------------------------------------------------------------------------
    item_content_line   content_line_3
    item_content_line   content_line_4
    item_content_line   content_line_5
    item_content_line   -----------------------------------------------------------------------------------------------------------------------
    item_content_line   other_line_1
    item_content_line   -----------------------------------------------------------------------------------------------------------------------
    item_content_line   -----------------------------------------------------------------------------------------------------------------------
    item_content_line   other_line_2
    item_content_line   -----------------------------------------------------------------------------------------------------------------------
    dashed_line_end     -----------------------------------------------------------------------------------------------------------------------
  item
    header
      WINGETPPPP
      -----------------------------------------------------------------------------------------------------------------------
    item_content_line   -----------------------------------------------------------------------------------------------------------------------
    item_content_line   -----------------------------------------------------------------------------------------------------------------------
    item_content_line   -----------------------------------------------------------------------------------------------------------------------
    item_content_line   -----------------------------------------------------------------------------------------------------------------------
    item_content_line   -----------------------------------------------------------------------------------------------------------------------
    item_content_line   content_line_6
  item
    header
      WINGETCC
      -----------------------------------------------------------------------------------------------------------------------
    item_content_line   content_line_7
    item_content_line   content_line_8
    item_content_line   -----------------------------------------------------------------------------------------------------------------------
    item_content_line   -----------------------------------------------------------------------------------------------------------------------
    dashed_line_end     -----------------------------------------------------------------------------------------------------------------------

If I do use the negative lookahead regex in ITEM_CONTENT_LINE it works with lark. But I'd prefer to do it with Lark rules or terminals.

And the other_line rule can be a dashed line in the file, just not one part of an item header or item dashed_line_end.

Also I have tried playing around with the rule priorities but I can't seem to get it perfect like the negative lookahead works.

Full test code for reference or running:

import re
import pprint
import lark
from lark import Lark
from lark import Token

# Lookahead so it cann't be the next header or dashed line
#ITEM_CONTENT_LINE: /(?!^-{100}-*$|^[A-Z]+$\n^-{100}-*$)^.+$/m

lark1_grammar = r"""
start: (item | other_line | item_content_line)*
other_line: LINE? _NL
item.2: header item_content_line* dashed_line_end?
header: HEADER_NAME _NL DASHED_LINE _NL
item_content_line: ITEM_CONTENT_LINE _NL
dashed_line_end: DASHED_LINE _NL

ITEM_CONTENT_LINE: /(?!^-{100}-*$|^[A-Z]+$\n^-{100}-*$)^.+$/m
LINE: /^.+$/m
HEADER_NAME: /^[A-Z]+$/m
DASHED_LINE: /^-{100}-*$/m
%import common.NEWLINE -> _NL
"""

def test_lark():
#    with open("2023-02-02 Windows Packages Winget Scoop Chocolatey.txt") as file:
    with open("test.txt") as file:
        string = ''.join(file.readlines())
        lark_grammar = Lark(lark1_grammar).parse
        parse_result0 = lark_grammar(string)
        print(parse_result0.pretty())

test_lark()