lloyd / yajl

A fast streaming JSON parsing library in C.
http://lloyd.github.com/yajl
ISC License
2.15k stars 435 forks source link

Make lexer use state rather than re-scanning previous text after break in input #202

Open nickd4 opened 6 years ago

nickd4 commented 6 years ago

Firstly, thanks for a really great JSON Library. I've tried several others and this one really speaks to me, because I was looking for something clean and minimal and well-thought out, unlike others I tried.

I was concerned about a possible quadratic behaviour, if I try to parse a file containing really gigantic strings (e.g. 1 Gigabyte), and I pass it the input in consistent sized blocks (e.g. 16 Megabytes), then every time I pass a new block, the lexer is going to re-scan all the previous blocks received for the string.

The rest of the system should be able to handle this as far as I can see. (The use case is something like a browser cache where it would keep a list of keys being filenames and strings being file contents).

So I decided to do an experimental change where the lexer uses state variables to pick up where it left off, instead of re-scanning the previous input. This worked out quite well. I've used a variant of "Duff's device" to achieve this without huge modifications to the existing code. In fact the logic flow is pretty much identical, except that I merged the handling of "true", "false" and "null", just because I could. I could clean up the UTF-8 string validation stuff slightly (see comments) but that would be extra change.

I think this would be an enhancement to the current lexer, what do others think about the idea? In fact, for my case I am happy to use a private fork, but I thought it good to contribute it upstream if possible.

nickd4 commented 5 years ago

Note that I wasn't experienced with pull requests when I filed this, and I seem to have linked to my experimental repository with many unrelated changes. To clarify, the proposed change is just to one or two functions which implement the relevant lexer functions as discussed in the original post.

I'll isolate them out and link the pull request later. It does not seem to matter at the moment since lloyd is inactive as many people have observed.

I am planning to take over maintainership of this project, by informally making a version available that contains (in general) the pull requests filed in this repo. However, it is a large project since there are many pull requests and I'm not sure that I can validate all of them, e.g. those relating to MSVC or embedded applications of the parser. So I plan to start the yajl maintainership project in a few months when I have a bit more time. If anyone is interested, please mail me: nick "AT" ndcode "DOT" org.