simdjson / simdjson-java

A Java version of simdjson, a high-performance JSON parser utilizing SIMD instructions
Apache License 2.0
289 stars 22 forks source link

A trick solution to achieve multi-field parsing and get compression of structured data by efficient parsing performance #59

Open heykirby opened 1 month ago

heykirby commented 1 month ago

hello, piotrrzysko, In many business scenarios, parsing multi value from json only requires path array as parameters and get the string type value only one time, usage like hive's json_tuple udf, for example: parseValue(json, 'path1', 'path2', 'path3',,,,), and return (value1,vaule2,value3,,,) Therefore, we can quickly get the value from json by bitIndexs built by simdjson. The advantage of this solution is that it avoids creating many java object instance for each json node, thereby avoiding garbage collection overhead, and can perform pruning operations, which can make performance better.

a simple example, json value is: {"field1":{"field2":"value2","field3":3},"field4":["value4","value5"]} we want to get paths is: [$.field1.field2,$.field4.0, $.field4]. (\$.field4 will compress list to string, \$.field4.0 will get first element from list) expect return value is [value2, value4, '["value4","value5"]']

Solution Implementation first, we can convert the path array to a tree。if node color is blue, means we want get value for the path, if the node is container type, we will compress it to string. for example $.field4

image

second,loop through the bitindex,and fill values into paths tree。 In the above example, the bitindex value is [0, 1, 9, 10, 11, 19, 20, 28, 29, 37, 38, 39, 40, 41, 49, 50, 51, 59, 60, 68, 69] In the picture below, I marked the position marked by bitindex with ‘#’. We can know that bitindex will mark the starting and ending positions of map type and list type ([ ] { }); the starting position of map type key and value and the middle ':' , and the position of ',' between different elements.

image

for the above example, we loop through the bitindex, step by step get the value of each node of json path tree, following is a simple flow chart

image image image image image image image

Since the json path tree can be reused, in the process of parsing multiple jsons, there is no need to build a json node tree for each json, but only a tree for the required path, which can improving parsing performance, and support compressing container type json data, and parsing multiple values ​​at the same time, and is compatible with the case where the json value on the path is null.

heykirby commented 1 month ago

benchmark, simdjson2 vs jackson, performance is more than 6 times higher. if parsing less of json fields, the performance improvement is particularly obvious. reference

simdjson: 95.936 ops jackson: 15.833

arouel commented 1 month ago

benchmark, simdjson2 vs jackson, performance is more than 6 times higher. if parsing less of json fields, the performance improvement is particularly obvious. reference

simdjson: 95.936 ops jackson: 15.833

I think the benchmark is flawed due to the current setup, see https://github.com/simdjson/simdjson-java/pull/60#discussion_r1797734246 for details.