lark-parser / lark_cython

Cython plugin for Lark, reimplementing the LALR parser & lexer for better performance
MIT License
44 stars 5 forks source link

Fails to parse round-trip string objects from ruamel.yaml #36

Open Erotemic opened 1 month ago

Erotemic commented 1 month ago

Describe the bug

When parsing text from a YAML file in ruamel.YAML it it returns a SingleQuotedScalarString object, which does inherit from the str type. Sending this string to the pure-python lark parser seems to work fine, but when sending it to the cython variant it throws a TypeError.

To Reproduce

The following is a MWE that reproduces the issue:

"""
Requirements:
    pip install ruamel.yaml lark-cython lark
"""
import io
import ruamel.yaml

NEW_RUAMEL = 1

class _YamlRepresenter:

    @staticmethod
    def str_presenter(dumper, data):
        # https://stackoverflow.com/questions/8640959/how-can-i-control-what-scalar-form-pyyaml-uses-for-my-data
        if len(data.splitlines()) > 1 or '\n' in data:
            text_list = [line.rstrip() for line in data.splitlines()]
            fixed_data = '\n'.join(text_list)
            return dumper.represent_scalar('tag:yaml.org,2002:str', fixed_data, style='|')
        return dumper.represent_scalar('tag:yaml.org,2002:str', data)

def _custom_new_ruaml_yaml_obj():
    """
    References:
        https://stackoverflow.com/questions/59635900/ruamel-yaml-custom-commentedmapping-for-custom-tags
        https://stackoverflow.com/questions/528281/how-can-i-include-a-yaml-file-inside-another
        https://stackoverflow.com/questions/76870413/using-a-custom-loader-with-ruamel-yaml-0-15-0
    """

    # make a new instance, although you could get the YAML
    # instance from the constructor argument
    class CustomConstructor(ruamel.yaml.constructor.RoundTripConstructor):
        ...

    class CustomRepresenter(ruamel.yaml.representer.RoundTripRepresenter):
        ...

    CustomRepresenter.add_representer(str, _YamlRepresenter.str_presenter)
    yaml_obj = ruamel.yaml.YAML()
    yaml_obj.Constructor = CustomConstructor
    yaml_obj.Representer = CustomRepresenter
    yaml_obj.preserve_quotes = True
    yaml_obj.width = float('inf')
    return yaml_obj

def codeblock(text):
    """
    Create a block of text that preserves all newlines and relative indentation
    """
    import textwrap
    return textwrap.dedent(text).strip('\n')

# For common constructs see:
# https://github.com/lark-parser/lark/blob/master/lark/grammars/common.lark
RESOLUTION_GRAMMAR_PARTS = codeblock(
    '''
    // Resolution parts of the grammar.
    magnitude: NUMBER

    unit: WORD

    numeric_unit: (magnitude WS* unit)
    implicit_unit: unit

    resolved_unit: numeric_unit | implicit_unit

    %import common.NUMBER
    %import common.WS
    %import common.WORD
    ''')

RESOLVED_UNIT_GRAMMAR = codeblock(
    r'''
    // RESOLVED WINDOW GRAMMAR. Eg. 2GSD
    ?start: resolved_unit
    ''') + '\n' + RESOLUTION_GRAMMAR_PARTS

def main():
    yaml_obj = _custom_new_ruaml_yaml_obj()
    file = io.StringIO("{key: '1mGSD'}")
    data = yaml_obj.load(file)
    text = data['key']

    # https://github.com/lark-parser/lark/blob/master/docs/_static/lark_cheatsheet.pdf
    import lark
    try:
        import lark_cython
        parser = lark.Lark(RESOLVED_UNIT_GRAMMAR, start='start', parser='lalr', _plugins=lark_cython.plugins)
    except ImportError:
        parser = lark.Lark(RESOLVED_UNIT_GRAMMAR, start='start', parser='lalr')

    print(f'{type(text)=}')
    print(f'{text.__class__.__mro__=}')

    parser.parse(text)

if __name__ == '__main__':
    """
    CommandLine:
        python ~/code/lark_cython/tests/test_yaml.py
    """
    main()

The type information it prints before it fails is:

type(text)=<class 'ruamel.yaml.scalarstring.SingleQuotedScalarString'>
text.__class__.__mro__=(<class 'ruamel.yaml.scalarstring.SingleQuotedScalarString'>, <class 'ruamel.yaml.scalarstring.ScalarString'>, <class 'str'>, <class 'object'>)

I've tested with versions:

3.11.9 (main, May 14 2024, 08:04:54) [GCC 12.2.0]
ruamel.yaml.__version__ = 0.18.6
lark.__version__ = 1.1.9
lark_cython.__version__ = 0.0.15

And

3.11.9 (main, May 13 2024, 14:03:39) [GCC 11.4.0]
ruamel.yaml.__version__ = 0.17.22
lark.__version__ = 1.1.7
lark_cython.__version__ = 0.0.15

My thought is that cython would handle a class that inherits from a str, but perhaps it doesn't I'm not sure if this can be fixed on the lark-cython side, but I figured it was worth reporting.

My current workarond is to do something like this:

        try:
            tree = parser.parse(text)
        except TypeError:
            if isinstance(text, str) and type(text) is not str:
                # We could be in a case where cython is failing to handle
                # overloaded string types. Try casting to a regular str.
                tree = parser.parse(str(text))
            else:
                raise
erezsh commented 1 month ago

Can you include the exception?

Also, why not always use parser.parse(str(text)) ?

Erotemic commented 1 month ago

It's a TypeError from the pyx file:

File ~/.pyenv/versions/3.11.9/envs/pyenv3.11.9/lib/python3.11/site-packages/lark/parser_frontends.py:100, in ParsingFrontend.parse(self, text, start, on_error)
     98 chosen_start = self._verify_start(start)
     99 kw = {} if on_error is None else {'on_error': on_error}
--> 100 stream = self._make_lexer_thread(text)
    101 return self.parser.parse(stream, chosen_start, **kw)

File ~/.pyenv/versions/3.11.9/envs/pyenv3.11.9/lib/python3.11/site-packages/lark/parser_frontends.py:95, in ParsingFrontend._make_lexer_thread(self, text)
     93 def _make_lexer_thread(self, text: str):
     94     cls = (self.options and self.options._plugins.get('LexerThread')) or LexerThread
---> 95     return text if self.skip_lexer else cls.from_text(self.lexer, text)

File lark_cython/lark_cython.pyx:384, in lark_cython.lark_cython.LexerThread.from_text()

TypeError: Argument 'text' has incorrect type (expected str, got SingleQuotedScalarString)

Even though the type I'm passing through is an instance of a class that inherits from str, it is not a base str, and I'm thinking Cython doesn't like that. It could just be that this is a Cython issue, I didn't see anything in the pyx file that looked like it was the implementation checking for an exact str.

The reason not to blindly cast the input to str is because it should raise a TypeError if I give it a bad type. Chances are a bad object would result in a parse error, but there is also a chance it wouldn't, so doing that cast is undesirable.

Even my workaround is undesirable because a class that inherits from str could still define __str__ and mess things up:

class MyStr(str):
    def __str__(self):
        return 'foo'

text = MyStr('hello')
assert text == 'hello'
# Undesirable, but pathological
assert str(text) == 'foo'

Although that second case is fairly pathological.

erezsh commented 1 month ago

Yes, the text is defined as str (for performance reasons), and I suppose Cython doesn't accept objects that inherit from str, since it wants it the actual type to be a simple wchar[] (or whatever).

Erotemic commented 1 month ago

That's my conclusion as well, but I haven't been able to find docs that explicitly state that is the case.

I'm not sure if this needs to be fixed in lark-cython or if it is better left to consumers. The workaround I'm using could be inserted either in lark.parser_frontends or by adding an additional pure-python layer in lark-cython. It's a fairly niche gotcha, and maybe just having this issue exist is good enough so the workaround is googleable.