feature request: support incremental/streaming lexing

haskell / alex

A lexical analyser generator for Haskell

https://hackage.haskell.org/package/alex

BSD 3-Clause "New" or "Revised" License

298 stars 82 forks source link

feature request: support incremental/streaming lexing #67

Open cartazio opened 9 years ago

cartazio commented 9 years ago

in a number of application domains, I need to deal with handling streaming inputs in an incremental fashion, and having a streaming lexer / tokenization layer helps immensely with writing the layers on top.

If adding such capabilities to Alex are viable, i'd be very interested in trying to help add them. (rather than having to reinvent a lot of the tooling that alex provides)

would this be a feature you'd be open to having added? @simonmar ?

cartazio commented 9 years ago

even better would be that alex already tacitly supports this and i'm simply not understanding it yet :)

simonmar commented 9 years ago

I'd happily accept a patch, provided it doesn't compromise the speed of non-streaming lexers.

dcoutts commented 8 years ago

@cartazio In many cases this can already be made to work, though it requires knowing something about the maximum token length. For example we have implemented a streaming JSON lexer using alex. This relies on the fact that there's largest possible token length (around 6 bytes iirc for JSON) so that we can tell when we get to the end of a chunk if the lexer returning an error is due to running out of input or a real failure. If it fails within 6 bytes of the end then we need to supply more input and try again, but if there's more input available than that then it's a real lex error.

simonmar commented 8 years ago

Interesting. I have many questions :) Where is your Alex lexer for JSON? Do you have a parser too? Is it faster than aeson?

cartazio commented 8 years ago

I have a properly streaming one I wrote at work a year ago that has way better memory behavior and incremental ingestion.

On Friday, September 16, 2016, Simon Marlow notifications@github.com wrote:

Interesting. I have many questions :) Where is your Alex lexer for JSON? Do you have a parser too? Is it faster than aeson?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/simonmar/alex/issues/67#issuecomment-247596656, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAQwkL3FPgNNe5vx_iF9CC0VB9tH8a2ks5qqpWHgaJpZM4FN1Sm .

simonmar commented 8 years ago

I am very happy for you.

cartazio commented 8 years ago

I can see about cleaning it up and getting thst into hackage if you want :)

On Friday, September 16, 2016, Simon Marlow notifications@github.com wrote:

I am very happy for you.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/simonmar/alex/issues/67#issuecomment-247617742, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAQwqcqLKOKjgk5ws8lTYNjJU4K19LWks5qqqorgaJpZM4FN1Sm .

iteratee commented 1 year ago

I got something working that is pull-based, and I'd be happy to try and get it cleaned up and merged.

You supply some monadic action that can be used to get additional data, and a maxmimum token length.

The lexer treats an empty result from the action as EOF. If there is a lex error it checks for additional data and rescans if the data is less than the user-supplied maximum token length. It also attempts to get more data at EOF.

There is probably room for improvement to differentiate errors that are occurring because of EOF and other errors, but this is a rough first cut.

It is currently only working for bytestrings, with code borrowed from the monad template. It could accomadate userstate fairly readily, but I didn't need that, so it's not written.

cartazio commented 1 year ago

Ooo, this sounds amazing !

cartazio commented 1 year ago

https://github.com/cartazio/streaming-machine-json This repo has the parser I mentioned

andreasabel commented 1 year ago

@iteratee If this is fully backwards-compatible and does not affect performance of what we have now, a PR would be welcome!

@simonmar wrote:

I'd happily accept a patch, provided it doesn't compromise the speed of non-streaming lexers.