lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
MIT License
4.75k stars 401 forks source link

Custom Lexer allowing for parsing non-standard spreadsheets from xslx, csv etc #1298

Open rooterkyberian opened 1 year ago

rooterkyberian commented 1 year ago

Suggestion

I'm dealing with various spreadsheets (in different formats: text-based csv, tsv, but also binary xls, xlsx ).

They can look like this:

"Name:", "Just some spreadheet"
"Additional metadata:", "some more information about this sheet"

"Table"
"no", "value1", "valuie2"
1, "a", "b"
2, "a10000", "b10000"
...
10000, "a10000", "b10000"

So lark can handle csv fine (albeit there is performance hit compared with csv module), but xlsx are no go.

Now the idea is to do initial tokenization with csv or xlrd module, and I guess that would mean writing a custom Lexer. Right now Lexer's are not that much advertised as something user replaceable. i.e. I think I can do it, but it seems like interface can break at any time - for example, I don't think there is support for custom LexerState.

So my question is - is implementation of custom Lexers to solve this issue "supported", or should I expect it to break with any upgrade and lark development team does not plan on "stabilizing" lexer API to allow such use case?

Describe alternatives you've considered My alternative to this issue is building my own finite-state-machine (basically a parser) on top of parsing done with csv/xlrd libraries.

Additional context Add any other context or screenshots about the feature request here.

MegaIng commented 1 year ago

XML based formats are not context free. The ability to parse them using lark, which primarily supports CFG is and will always be quite limited. It's possible to account for some context sensitive conditions, like python style indentation. To a limit degree, this also applies to XML. What you can try is to use a Postlexer. That interface is guaranteed to be stable.

The interface for lexer is also quite stable. It hasn't seen that much development because there haven't been that many usecases. If your problem can't be solved with Postlexer, but could be solved with an improved Lexer interface (solved, not just maybe possible), I am sure we could consider updating it. But I doubt you will manage to coerce it into a working solution.