jqlang / jq

Command-line JSON processor
https://jqlang.github.io/jq/
Other
30.24k stars 1.57k forks source link

simdjson #1892

Open ekg opened 5 years ago

ekg commented 5 years ago

Would it be possible to leverage simdjson to improve parsing speed?

https://github.com/lemire/simdjson

Apologies if this is a duplicate.

nicowilliams commented 5 years ago

It'd be nice to do this, yes, but there are issues.

First, jq's JSON parser is incremental, and so can consume JSON texts in chunks, whereas simdjson seems to need the entire text (and does two passes on the text). That doesn't mean that we couldn't use simdjson's SIMD techniques when we have the full text, but generally those won't be large texts. I'm not sure how easily we could adapt SIMD techniques to work incrementally, though at first glance it seems feasible.

Second, jq supports multiple JSON texts in sequence. This is less tricky.

pkoppstein commented 5 years ago

... but there are issues.

Third, jq's parser is currently quite lenient -- perhaps too much so in some cases (e.g. allowing 00 for 0), but well within its rights in others (e.g. allowing duplicate keys within a JSON object).

Fourth, there is a plan to allow jq to preserve arbitrarily big integers (in the sense that jq . will not modify them), and it would be a shame to lose that functionality.

(If jq is to support stricter parsing, then for the sake of utility as well as backward compatibility, it should be possible to specify which mode is wanted.)

lemire commented 4 years ago

well within its rights in others (e.g. allowing duplicate keys within a JSON object).

simdjson allows duplicate keys, the RFC does not require uniqueness: https://tools.ietf.org/html/rfc7159

there is a plan to allow jq to preserve arbitrarily big integers

Though simdjson is limited to 64-bit numbers at this time, there are plans to extend the support... https://github.com/lemire/simdjson/issues/167

lemire commented 4 years ago

First, jq's JSON parser is incremental, and so can consume JSON texts in chunks, whereas simdjson seems to need the entire text (and does two passes on the text). That doesn't mean that we couldn't use simdjson's SIMD techniques when we have the full text, but generally those won't be large texts. I'm not sure how easily we could adapt SIMD techniques to work incrementally, though at first glance it seems feasible.

This issue is being worked on in simdjson...

https://github.com/lemire/simdjson/issues/128

Second, jq supports multiple JSON texts in sequence. This is less tricky.

This issue is also being worked on in simdjson...

https://github.com/lemire/simdjson/issues/188

cc @piotte13

lemire commented 4 years ago

simdjson now supports JSON documents in sequence. The JSON can be either line-separated or just in sequence with arbitrary white space between them. The input can be nearly infinite...

The performance is quite good... (gigabytes per second)

https://github.com/lemire/simdjson/blob/master/doc/JsonStream.md

cc @piotte13

hitorilabs commented 1 year ago

Is anyone working on this already? (or finished some form of it)

liquidaty commented 1 year ago

A few thoughts I would put forth for consideration:

nicowilliams commented 1 year ago

First, jq's JSON parser is incremental, and so can consume JSON texts in chunks, whereas simdjson seems to need the entire text (and does two passes on the text). That doesn't mean that we couldn't use simdjson's SIMD techniques when we have the full text, but generally those won't be large texts. I'm not sure how easily we could adapt SIMD techniques to work incrementally, though at first glance it seems feasible.

This issue is being worked on in simdjson...

simdjson/simdjson#128

That's still open. Is it likely we can get that?

Second, jq supports multiple JSON texts in sequence. This is less tricky.

This issue is also being worked on in simdjson...

simdjson/simdjson#188

Sweet!

Maybe we could use simdjson for jv_parse() and jv_parse_sized(), which is to say "for fromjson". Now, we'll still have to allocate a bunch of objects, so that's not terribly fun, and I wonder what we could do about that. Now in the streaming JSON parser we only ever need to allocate an array for the paths to scalars and for the scalars, and we can reuse the path array, so we should be able to get more bang for this effort there.

nicowilliams commented 1 year ago

A few thoughts I would put forth for consideration:

Yes, this will be a lot of profiling and playing with options to find something that rocks perf-wise and isn't too hard to use.