Setting non-string values on Tokens

geographika commented 1 year ago

Token values in lark_cython are typed as str.

In my Transformer I'm changing the token values to their correct types, for example int:

    def int(self, t):
        v = t[0]
        v.value = int(v.value)
        # return full Token rather than an int as Token properties are required later
        return v

This throws the following error in lark_cython:

lark_cython\lark_cython.pyx:20: in lark_cython.lark_cython.Token.value.__set__  TypeError: Expected unicode, got int

I return the full token in the transformer, rather than simply an int value, as later logic (for error handling etc.) takes advantage of the token properties:

line, column = key_token.line, key_token.column

Do token values have to be strings in lark_cython? If so, is there any approach which would allow token values to be converted to support int, float etc.?

erezsh commented 1 year ago

Hello!

Yes, Token.value is defined as string, for performance reasons. It's defined here: https://github.com/lark-parser/lark_cython/blob/master/lark_cython/lark_cython.pyx#L20

But nothing's stopping you from converting them to lark.Tokens, like so:

    return lark.Token.new_borrow_pos(t.type, float(t.value), t)

geographika commented 1 year ago

Thanks @erezsh, and thanks for this project! This approach works fine for all the transformer functions that convert to any types that aren't str. Does this approach however negate any performance boosts from using lark_cython? The full test suite went from 1min20 to 1min50 (a very rough benchmark). I'll continue to play around and look at failing tests.

erezsh commented 1 year ago

This shouldn't have any direct effect on lark-cython's performance. But it's possible that you are creating a lot of Token instances, and that's taking a lot of time. (I do remember mappyfiles having a lot of ints in them)

lark-parser / lark_cython

Setting non-string values on Tokens #7