lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
MIT License
4.8k stars 409 forks source link

file name and extension grammar #277

Closed gercinojr closed 5 years ago

gercinojr commented 5 years ago

lark-parser: 0.6.5 python: 3.6.6 my code is not working... can anybody help me?

from lark import Lark

grammar = '''
start       : filename
filename    : NAME "." EXTENSION
EXTENSION   : "mp3" | "wav" | "flac" | "wma" | "ogg"
CHAR        : /[a-zA-Z]/
WORD        : CHAR+ 
NAME        : WORD (" " WORD)*
'''

# parser
p = Lark(grammar, parser="earley")

filename1 = "flaming.mp3"         # ok! :-)
#filename2 = "pow r. toc h..mp3"  # error! :-(

# tree
t1 = p.parse( filename1 )
#t2 = p.parse( filename2 )
print( t1 )
#print( t2 )

even if I change the grammar to ...

CHAR        : /[a-zA-Z.]/

adding a dot... errors again :-(

erezsh commented 5 years ago

What are you trying to achieve? Parsing filenames?

What error are you getting, and what output do you expect?

gercinojr commented 5 years ago

Hello, yes the goal is to parse a list of file names (approximately 6500)

The error I'm getting is:

File "/home/gercino/.local/lib/python3.6/site-packages/lark/parsers/xearley.py", line 119, in scan
    raise UnexpectedCharacters(stream, i, text_line, text_column, {item.expect for item in to_scan}, set(to_scan))
lark.exceptions.UnexpectedCharacters: No terminal defined for ' ' at line 1 col 7

pow r. toc h..mp3
      ^

Expecting: {Terminal('EXTENSION')}

I would like it to identify the file names (with or without dots in their names) as well as their extensions.

thanks for your attention.

erezsh commented 5 years ago

Without looking too closely, I'm guessing that the bug is that right now EXTENSION is mandatory.

Try:

filename    : NAME ["." EXTENSION]
erezsh commented 5 years ago

Also, why are r. and h..mp3 legal filenames?

gercinojr commented 5 years ago

hi... the code has two filenames

filename1 = "flaming.mp3"  
filename2 = "pow r. toc h..mp3"  # just one filename
# pink floyd/1967 - the piper at the gates of dawn/05 - pow r. toc h..mp3

I changed the grammar to

start       : filename
filename    : NAME ["." EXTENSION]
EXTENSION   : "mp3" | "wav" | "flac" | "wma" | "ogg"
CHAR        : /[a-zA-Z0-9.]/
WORD        : CHAR+ 
NAME        : WORD (" " WORD)*
'''

and the result was

Tree(start, [Tree(filename, [Token(NAME, 'flaming.mp3')])])
Tree(start, [Tree(filename, [Token(NAME, 'pow r. toc h..mp3')])])

the extension is no longer being recognized

The first version of code worked on another version of Lark. I do not remember now ... but after I updated Python and Lark the code stopped working. The problem is that the "dot" before the extension is being consumed as if it were in the file name. In the older version the "dot" was recognized as the separator between the filename and the extension. But not now. :-(

gercinojr commented 5 years ago

it is as if the model parser="earley" was working as parser="lalr".

erezsh commented 5 years ago

@gercinojr Yes, I understand now. This happened because of a change I added to the default Earley behavior, which was intended to fix a performance issue.

However, the old behavior is still available under a special lexer:

grammar = '''
start       : filename
filename    : NAME "." EXTENSION
EXTENSION   : "mp3" | "wav" | "flac" | "wma" | "ogg"
CHAR        : /[a-zA-Z.]/
WORD        : CHAR+
NAME        : WORD (" " WORD)*
'''

# parser
p = Lark(grammar, parser="earley", lexer="dynamic_complete")

This should work.

I should also note that you can achieve this exact task using only regexps, in case you're inclined to try. Either way, I hope this helps.

gercinojr commented 5 years ago

Hello again! :-) I tried the solution you suggested but another type of error appeared. So I wrote a new file with three lines of code and the error was repeated.

new code to test error:

from lark import Lark
grammar="""
start: /a-z/
"""
p = Lark(grammar, parser="earley", lexer="dynamic_complete")

error:

Traceback (most recent call last):
  File "test.py", line 5, in <module>
    p = Lark(grammar, parser="earley", lexer="dynamic_complete")
  File "/home/gercino/.local/lib/python3.6/site-packages/lark/lark.py", line 165, in __init__
    self.parser = self._build_parser()
  File "/home/gercino/.local/lib/python3.6/site-packages/lark/lark.py", line 188, in _build_parser
    return self.parser_class(self.lexer_conf, parser_conf, options=self.options)
  File "/home/gercino/.local/lib/python3.6/site-packages/lark/parser_frontends.py", line 122, in __init__
    super(self).__init__(*args, complete_lex=True, **kw)
TypeError: super() argument 1 must be type, not XEarley_CompleteLex

I researched a bit on google for a solution and found this in stakoverflow:

https://stackoverflow.com/questions/1713038/super-fails-with-error-typeerror-argument-1-must-be-type-not-classobj-when

Your problem is that class B is not declared as a "new-style" class. Change it like so:

class B(object):

and it will work. super() and all subclass/superclass stuff only works with new-style classes. I recommend you get in the habit of always typing that (object) on any class definition to make sure it is a new-style class.

this was what was written in the most voted answer.

gercinojr commented 5 years ago

lark-parser: 0.6.5 python: 3.6.6

erezsh commented 5 years ago

You're right, sorry, it's still an issue. I'm planning to release a fix for this soon. Meanwhile, you can try the 0.7b branch, where it should already be working.

gercinojr commented 5 years ago

Thank you for your attention. I'll wait for the new version to be released. :-)

erezsh commented 5 years ago

This should be working now in master. Feel free to re-open if there's still an issue