Support 'bytes' for input and Tokens, in addition to 'str'

ctrlcctrlv commented 4 years ago

Describe the bug

https://github.com/lark-parser/lark/blob/b5abf2d7afcd70af2be6c57aa454e532c8a99a0b/lark/lexer.py#L99

This merely masks what in 2020 is almost always a bug. If you must do something other than raise an error, decoding with backslashreplace is better. If this is a workaround because you're sometimes allowing Token to be a bytes...just allow its value to be a bytes...

erezsh commented 4 years ago

Can you provide an example in which it poses an issue?

ctrlcctrlv commented 4 years ago

Given how you responded to me in #624, I don't know if I should even bother posting this, but I already wrote it, so...if you don't want my help, just say so, and I'll go away. I gave you a simple example that causes a crash there. There are plenty of projects I can give my time to in a day...doesn't need to be yours... 😄

Anyway...It is quite common, actually, to use LALR parsers to parse binary formats which have regular grammars and don't require more advanced state machines. If I'm not mistaken, BMP files (Windows bitmap) are simple enough to be parsed with LALR parser.

But I don't have time to implement one. So one other simple binary regular format is, actually, UTF8 itself! Here's a grammar I wrote:

utf8.lark

start: BOM? char*
BOM: "\xef\xbb\xbf"
char: CHAR1 | CHAR2 | CHAR3 | CHAR4
CONTINUATION_BYTE: "\x80" .. "\xbf"
CHAR1: "\x00" .. "\x7f"
CHAR2: "\xc0" .. "\xdf" CONTINUATION_BYTE
CHAR3: "\xe0" .. "\xef" CONTINUATION_BYTE CONTINUATION_BYTE
CHAR4: "\xf0" .. "\xf7" CONTINUATION_BYTE CONTINUATION_BYTE CONTINUATION_BYTE

This works fine, and other than the BOM, doesn't look half bad. Sadly, it is ugly to use, because Latin-1 seems to be the sanctioned Lark "way of doing binary", instead of Python's way, which is bytestrings:

utf8.py

# Example script for UTF8 grammar utf8.lark
# By Fredrick R Brennan, 2020
# This file is released into the public domain. I dedicate the work to the public domain by waiving all of my rights to the work worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law. You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. 

import lark
import regex
import sys
sys.stdout = open(sys.stdout.fileno(), mode="w", encoding="utf8", buffering=1)
import os

with open(os.path.dirname(__file__)+os.path.sep+"utf8.lark", "r") as f:
    UTF8_Grammar = f.read()

UTF8_Parser = lark.Lark(UTF8_Grammar, regex=True, parser="lalr")
# if we don't put the decode("latin1"), Lark will crash because it tries to call string-only functions on it.
# just one quick example, lexer.py:TraditionalParser.match calls mre.match on the "stream"
# which doesn't work with either regex=True or regex=False.
tree = UTF8_Parser.parse("🔣 地球の絵はグリーンでグッド? Chikyū no e wa gurīn de guddo".encode("utf-8").decode("latin1"))

# so that's why we do some hacking w/these 2 classes
class Char(str):
    def __str__(self):
        return self.__repr__()

    def __repr__(self):
        return self.encode("latin1").decode("ascii", "backslashreplace")

class UTF8MakeTreePrettier(lark.Visitor):
    def char(self, tree):
        tree.data = tree.children[0].type
        tree.children[0] = Char(tree.children[0].value)

UTF8MakeTreePrettier().visit(tree)
print(tree.pretty())

for enc, j in [("sjis", "地球の絵はグリーンでグッド?  Chikyuu no e wa guriin de guddo"), 
          ("sjis", "売春婦"), 
          ("euc-jp", "乂鵬鵠")]:
    try:
        j.encode(enc).decode("utf-8")
    except UnicodeDecodeError as e:
        print("LARK should fail: ", e.reason)
        print(e)
        print("=")
    try:
        junk = UTF8_Parser.parse(j.encode(enc).decode("latin1"))
        print(junk.pretty())
    except lark.UnexpectedCharacters as e:
        print("LARK failed: {}".format(str(e)))
        print("~")

Output is...not great without the Char/UTF8MakeTreePrettier hackery:

start
  char  ð£
  char   
  char  å°
  char  ç
  char  ã®
  char  çµµ
  char  ã¯
  char  ã°
  char  ãª
  char  ã¼
  char  ã³
  char  ã§
  char  ã°
  char  ã
  char  ã
  char  ?
  char   
  char  C
  char  h
  char  i
  char  k
  char  y
  char  Å«
  char   
  char  n
  char  o
  char   
  char  e
  char   
  char  w
  char  a
  char   
  char  g
  char  u
  char  r
  char  Ä«
  char  n
  char   
  char  d
  char  e
  char   
  char  g
  char  u
  char  d
  char  d
  char  o

With hackery:

start
  CHAR4 \xf0\x9f\x94\xa3
  CHAR1  
  CHAR3 \xe5\x9c\xb0
  CHAR3 \xe7\x90\x83
  CHAR3 \xe3\x81\xae
  CHAR3 \xe7\xb5\xb5
  CHAR3 \xe3\x81\xaf
  CHAR3 \xe3\x82\xb0
  CHAR3 \xe3\x83\xaa
  CHAR3 \xe3\x83\xbc
  CHAR3 \xe3\x83\xb3
  CHAR3 \xe3\x81\xa7
  CHAR3 \xe3\x82\xb0
  CHAR3 \xe3\x83\x83
  CHAR3 \xe3\x83\x89
  CHAR1 ?
  CHAR1  
  CHAR1 C
  CHAR1 h
  CHAR1 i
  CHAR1 k
  CHAR1 y
  CHAR2 \xc5\xab
  CHAR1  
  CHAR1 n
  CHAR1 o
  CHAR1  
  CHAR1 e
  CHAR1  
  CHAR1 w
  CHAR1 a
  CHAR1  
  CHAR1 g
  CHAR1 u
  CHAR1 r
  CHAR2 \xc4\xab
  CHAR1 n
  CHAR1  
  CHAR1 d
  CHAR1 e
  CHAR1  
  CHAR1 g
  CHAR1 u
  CHAR1 d
  CHAR1 d
  CHAR1 o

LARK should fail:  invalid start byte
'utf-8' codec can't decode byte 0x92 in position 0: invalid start byte
=
LARK failed: No terminal defined for '' at line 1 col 1

nÌGÍO[ÅObh?  Chikyuu no 
^

Expecting: {'CHAR3', 'CHAR4', 'CHAR1', 'CHAR2'}

~
LARK should fail:  invalid start byte
'utf-8' codec can't decode byte 0x94 in position 0: invalid start byte
=
LARK failed: No terminal defined for '' at line 1 col 1

tw
^

Expecting: {'CHAR3', 'CHAR4', 'CHAR1', 'CHAR2'}

~
LARK should fail:  invalid start byte
'utf-8' codec can't decode byte 0xb9 in position 4: invalid start byte
=
LARK failed: No terminal defined for '¹' at line 1 col 5

Ð©Ë²¹ô
    ^

Expecting: {'CHAR3', 'CHAR4', 'CHAR1', 'CHAR2'}

Previous tokens: Token(CHAR2, 'Ë²')

~

In summary, although it works, using Latin-1 as a passthrough is a code smell and I hope some day you make it so Lark can work on byte strings.

erezsh commented 4 years ago

So, just to be clear, the issue isn't with latin1, but that Lark doesn't work with bytes, right?

But I don't have time to implement one.

But you want me to fix things for you for free? That's funny.

I'm not opposed to adding support for bytes, if anyone sends a PR that's compatible with Python 2. But since the vast majority of Lark users don't care about it, it's not a high priority for me to fix right now.

ctrlcctrlv commented 4 years ago

Not having time to implement a BMP decoder I'll never use is not necessarily the same thing as not having time to PR bytes, which I would probably use for other things as well :-)

@erezsh It's up to you what your personal roadmap looks like...an issue is just that, an issue...sometimes I come up with issues maintainers never thought of, and they like the idea and do it even it takes them some time because they like the idea. I know how things work in open source world...sometimes maintainers don't care about what I find, and my issue remains open indefinitely until either I fix it, they do, or a third party does. Sometimes neither happens, and that's okay too. You really don't have to behave this way...sarcastic answers, etc. Why have a tracker if you don't want new issues? Just sticky this sentence:

But you want me to fix things for you for free? That's funny.

And be done with it. 🤷‍♂️ I'm not ordering you to do anything. You really don't need the sarcasm. This is more than sufficient:

I'm not opposed to adding support for bytes, if anyone sends a PR that's compatible with Python 2. But since the vast majority of Lark users don't care about it, it's not a high priority for me to fix right now.

erezsh commented 4 years ago

@ctrlcctrlv Tbh, you came off as rude, so I was replying in turn. Saying you "don't have time" reminded me of the other issue where you refused to edit your example because it's "good enough" while still asking for my help, like I have time to debug your code.

But let's put it aside. If you want to submit a PR, that's fine. I'll be happy to help you in the process.

You didn't reply to this:

So, just to be clear, the issue isn't with latin1, but that Lark doesn't work with bytes, right?

If I'm going to keep this issue open, I want to make sure it has an informative title.

ctrlcctrlv commented 4 years ago

Well no @erezsh I think you misunderstood me, I didn't refuse to edit the example, I was trying to say in a nice way that I couldn't think of how to cut anything out but still show the problem.

I guess, re-reading and meditating on your message, you meant that I should cut out the example grammar and build the tree myself, basically, in code, like:

tree = lark.Tree()
tree.data = "root"
tree.children = list(lark.Token("leaf", "L"), lark.Tree("branch", [lark.Token("leaf", "M"), lark.Token("leaf", "N")]), lark.Token("leaf", "O"))

Then make a lark.Transformer for it.

"Quite small" was supposed to mean, "I'm not sure how to make it smaller", not "I won't do it". Sorry if it came off other way to you. I thought maybe an explanation would be okay in lieu of minifying since I couldn't think of a substantial cut. In future, if I'm refusing to ever do something, I'll make that very clear, putting in there somewhere "No."...if I don't put that, I'm still open to it.

Back onto this topic:

So, just to be clear, the issue isn't with latin1, but that Lark doesn't work with bytes, right?

No, it's not with Latin-1 per se, correct. Latin-1 is the symptom, the workaround, to the main issue, that bytes aren't supported.

lark-parser / lark

Support 'bytes' for input and Tokens, in addition to 'str' #626