simdjson - Githubissues

ekg commented 5 years ago

Would it be possible to leverage simdjson to improve parsing speed?

Apologies if this is a duplicate.

nicowilliams commented 5 years ago

It'd be nice to do this, yes, but there are issues.

First, jq's JSON parser is incremental, and so can consume JSON texts in chunks, whereas simdjson seems to need the entire text (and does two passes on the text). That doesn't mean that we couldn't use simdjson's SIMD techniques when we have the full text, but generally those won't be large texts. I'm not sure how easily we could adapt SIMD techniques to work incrementally, though at first glance it seems feasible.

Second, jq supports multiple JSON texts in sequence. This is less tricky.

pkoppstein commented 5 years ago

... but there are issues.

Third, jq's parser is currently quite lenient -- perhaps too much so in some cases (e.g. allowing 00 for 0), but well within its rights in others (e.g. allowing duplicate keys within a JSON object).

Fourth, there is a plan to allow jq to preserve arbitrarily big integers (in the sense that jq . will not modify them), and it would be a shame to lose that functionality.

(If jq is to support stricter parsing, then for the sake of utility as well as backward compatibility, it should be possible to specify which mode is wanted.)

lemire commented 4 years ago

well within its rights in others (e.g. allowing duplicate keys within a JSON object).

simdjson allows duplicate keys, the RFC does not require uniqueness: https://tools.ietf.org/html/rfc7159

there is a plan to allow jq to preserve arbitrarily big integers

Though simdjson is limited to 64-bit numbers at this time, there are plans to extend the support... https://github.com/lemire/simdjson/issues/167

lemire commented 4 years ago

First, jq's JSON parser is incremental, and so can consume JSON texts in chunks, whereas simdjson seems to need the entire text (and does two passes on the text). That doesn't mean that we couldn't use simdjson's SIMD techniques when we have the full text, but generally those won't be large texts. I'm not sure how easily we could adapt SIMD techniques to work incrementally, though at first glance it seems feasible.

This issue is being worked on in simdjson...

https://github.com/lemire/simdjson/issues/128

Second, jq supports multiple JSON texts in sequence. This is less tricky.

This issue is also being worked on in simdjson...

https://github.com/lemire/simdjson/issues/188

cc @piotte13

lemire commented 4 years ago

simdjson now supports JSON documents in sequence. The JSON can be either line-separated or just in sequence with arbitrary white space between them. The input can be nearly infinite...

The performance is quite good... (gigabytes per second)

https://github.com/lemire/simdjson/blob/master/doc/JsonStream.md

cc @piotte13

hitorilabs commented 1 year ago

Is anyone working on this already? (or finished some form of it)

liquidaty commented 1 year ago

A few thoughts I would put forth for consideration:

can we come up with one or two particular clear "painful" use case(s) we can target-- including sample filters as well as sample inputs?
- presumably, the vast majority of jq use cases are using a filter that is not simply '.', so this isn't the ideal filter use case to optimize for
- optimizing for strict JSON might be very different from optimizing for non-strict JSON (or other e.g. YAML / XML), and input size might also impact
- ultimately, the initial goal would be to devise the specific benchmark tests to target for optimization
we may find that the bottleneck in the most important use cases is not the JSON input parsing, but rather the filter calculations, or maybe the way that the parser hands off data to the filter calculator. That's not to say that optimizing the JSON input parsing is not valuable in its own right no matter what, but rather to recognize that we all have limited time and resources to contribute, and informing how we spend that can make a drastic difference in how much impact that contribution makes-- and deciding how this issue request should be prioritized relative to other performance-related ones such as #1857
there are several mentions of multiple contemplated parser modes in terms of strictness and input format (e.g. XML/yaml) (https://github.com/jqlang/jq/issues/2643, https://github.com/jqlang/jq/issues/1892#issuecomment-485574572, https://github.com/jqlang/jq/pull/2548#issuecomment-1624185738, https://github.com/jqlang/jq/issues/2530, https://github.com/jqlang/jq/issues/467). Given this, the most expeditious approach might be to first evaluate these as a whole, identify commonalities, and then refactor the current jq parser to separate those commonalities, before then broadening the supported input formats
- not sure if this would turn out to be the case, but as an illustrative example, perhaps that collective analysis would then suggest refactoring the parser to an event-driven structure, where the event handlers could be reused by all of the different input parsers. Knowing that and performing that refactoring ahead of time might improve performance and almost certainly will make the subsequent work of supporting other input formats much easier, and lower maintenance learning curves and technical debt in the process

nicowilliams commented 1 year ago

First, jq's JSON parser is incremental, and so can consume JSON texts in chunks, whereas simdjson seems to need the entire text (and does two passes on the text). That doesn't mean that we couldn't use simdjson's SIMD techniques when we have the full text, but generally those won't be large texts. I'm not sure how easily we could adapt SIMD techniques to work incrementally, though at first glance it seems feasible.

This issue is being worked on in simdjson...

simdjson/simdjson#128

That's still open. Is it likely we can get that?

Second, jq supports multiple JSON texts in sequence. This is less tricky.

This issue is also being worked on in simdjson...

simdjson/simdjson#188

Sweet!

Maybe we could use simdjson for jv_parse() and jv_parse_sized(), which is to say "for fromjson". Now, we'll still have to allocate a bunch of objects, so that's not terribly fun, and I wonder what we could do about that. Now in the streaming JSON parser we only ever need to allocate an array for the paths to scalars and for the scalars, and we can reuse the path array, so we should be able to get more bang for this effort there.

nicowilliams commented 1 year ago

A few thoughts I would put forth for consideration:

Yes, this will be a lot of profiling and playing with options to find something that rocks perf-wise and isn't too hard to use.

jqlang / jq

simdjson #1892