Unable to parse Arabic text

lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.

MIT License

4.62k stars 395 forks source link

Unable to parse Arabic text #1405

Closed jmishra01 closed 2 months ago

jmishra01 commented 3 months ago

Lark fails to parse Arabic text. Kindly check the sample Python code below to re-generate the issue

import lark

text_1 =  "'راض جداً'"

grammar = """
start: string
string: "'" ( /([\w\s]|''|"|\\'|\\|[`~&!@#$%\^\*\(\)-_=+])+/| /[u"\u0001-\uFFFF"]/ ) "'"

%import common (CNAME, WS, SIGNED_NUMBER, INT)
%import common.NEWLINE -> _NL
%import common.WS_INLINE
%ignore WS_INLINE
%ignore WS
"""

parser = lark.Lark(grammar = grammar)
parse_text_1 = parser.parse(text_1)
print(parse_text_1)

MegaIng commented 3 months ago

Lark is perfectly capable of parsing Arabic text. Your grammar just doesn't match the given text. Depending on what exactly you meant to do, you need to change your definition of string. Most notably, the second term you added | /[u"\u0001-\uFFFF"]/ ) matches (almost) any single character, which probably isn't what you want.

erezsh commented 3 months ago

Your definition of string doesn't include repetition... it can only match a single character.

There are online regexp IDEs, that can help you. You can also test regexps directly using Python's re module.

jmishra01 commented 2 months ago

Thanks, @MegaIng and @erezsh, for the quick reply.

The problem is resolved using the below grammar.

grammar = """
start: string
string:  /'([^'\\]*(?:\\.[^'\\]*)*)'/

%import common (CNAME, WS, SIGNED_NUMBER, INT)
%import common.NEWLINE -> _NL
%import common.WS_INLINE
%ignore WS_INLINE
%ignore WS
"""