rsonquery / rsonpath

Blazing fast JSONPath query engine written in Rust.
https://rsonquery.github.io/rsonpath/
MIT License
50 stars 8 forks source link

Properly handle UTF-8 labels #117

Open V0ldek opened 1 year ago

V0ldek commented 1 year ago

Is your feature request related to a problem? Please describe. The engine currently violates the JSON spec by not normalizing Unicode escapes. We do this for performance purposes, since ordinal comparison can be easily SIMDified, but it's not correct.

For a simple example, the UTF-8 codepoint for the letter "a" is 0x0061. These JSONs are equivalent under RFC 8259:

{"a":42}
{"\u0061":42}

Therefore the query $["a"] should in both cases match the value 42.

Quite sensibly, and indeed officially under the current JSONPath RFC Draft, the queries $["a"] and $["\u0061"] must also be equivalent. All four combinations of the two documents above and the two queries must yield the same result -- the value 42.

Describe the solution you'd like The tradeoff here is important. We expect the difference in performance to be staggering, especially since the head-skip optimisation is by design incompatible with this. We need a flag that will toggle this behaviour. I propose we make this the optional behaviour – we expect the vast majority of labels to be ASCII, if a user wants to match unicode they can use the flag.

github-actions[bot] commented 1 year ago

Tagging @V0ldek for notifications

V0ldek commented 9 months ago

I'm rolling all escape handling under this umbrella. It seems there is no easy way to handle strings with escapes like \" or \n without introducing general unicode support. I will start by introducing proper parsing in #116 for unicode in the query string values, and then proper unicode comparison should be easy to do in principle. As to how to do this fast...