Open ZhaiMo15 opened 1 month ago
Would you mind sharing the hardware and Java version you used to run these benchmarks? The output from lscpu
and java -version
should be sufficient. Also, it'd be great if you could share the benchmarks on a branch (or as a github repo). This would help me run them myself and profile the code.
Also, what do you mean by:
What's more, I changed the depth of test, the default is 3 and I changed it to 2, as below: ?
Both attached snippets look the same.
Consider the JSON as a tree, the depth I meant is the depth of the node of the tree. For example,
// depth 1
{
a: 1,
b: 2
}
// depth 2
{
a: {
b: 1,
c: 2
}
}
The V2 benchmark visited the data of "statuses.user.screen_name"(depth 3) while V3 is "statuses.id"(depth 2).
Both attached snippets look the same.
I'm sorry, I paste the wrong snippets.
And I re-run the benchmarks in a more stable environment, the results are similar. My hardware:
And the java version:
The benchmarks I ran are in: https://github.com/ZhaiMo15/simdjson-java/tree/performanceTest
Thanks for the update. I run your benchmarks on my desktop using two versions of Java (18 and 21). I got the following results:
Benchmark (fileName) Mode Cnt Score Error Units
TwitterBenchmarkV2.JsonValueSimdjson /twitter.json thrpt 5 1272.131 ± 77.097 ops/s
TwitterBenchmarkV2.recordJackson /twitter.json thrpt 5 577.435 ± 14.169 ops/s
TwitterBenchmarkV2.recordSimdjson /twitter.json thrpt 5 1963.116 ± 44.895 ops/s
TwitterBenchmarkV3.JsonValueSimdjson /twitter.json thrpt 5 1198.611 ± 63.058 ops/s
TwitterBenchmarkV3.recordJackson /twitter.json thrpt 5 748.944 ± 6.980 ops/s
TwitterBenchmarkV3.recordSimdjson /twitter.json thrpt 5 1976.876 ± 241.907 ops/s
TwitterBenchmarkV2.JsonValueSimdjson /twitter50.json thrpt 5 2435.258 ± 194.432 ops/s
TwitterBenchmarkV2.recordJackson /twitter50.json thrpt 5 1125.062 ± 2.477 ops/s
TwitterBenchmarkV2.recordSimdjson /twitter50.json thrpt 5 3787.566 ± 51.370 ops/s
TwitterBenchmarkV3.JsonValueSimdjson /twitter50.json thrpt 5 2429.744 ± 153.990 ops/s
TwitterBenchmarkV3.recordJackson /twitter50.json thrpt 5 1427.326 ± 6.044 ops/s
TwitterBenchmarkV3.recordSimdjson /twitter50.json thrpt 5 3883.897 ± 34.568 ops/s
TwitterBenchmarkV2.JsonValueSimdjson /twitter1.json thrpt 5 175359.845 ± 1522.490 ops/s
TwitterBenchmarkV2.recordJackson /twitter1.json thrpt 5 69225.339 ± 628.405 ops/s
TwitterBenchmarkV2.recordSimdjson /twitter1.json thrpt 5 83146.423 ± 654.256 ops/s
TwitterBenchmarkV3.JsonValueSimdjson /twitter1.json thrpt 5 181205.520 ± 269.705 ops/s
TwitterBenchmarkV3.recordJackson /twitter1.json thrpt 5 96834.366 ± 268.782 ops/s
TwitterBenchmarkV3.recordSimdjson /twitter1.json thrpt 5 102403.918 ± 625.203 ops/s
Benchmark (fileName) Mode Cnt Score Error Units
TwitterBenchmarkV2.JsonValueSimdjson /twitter.json thrpt 5 1152.353 ± 258.483 ops/s
TwitterBenchmarkV2.recordJackson /twitter.json thrpt 5 565.322 ± 17.139 ops/s
TwitterBenchmarkV2.recordSimdjson /twitter.json thrpt 5 1759.905 ± 41.637 ops/s
TwitterBenchmarkV3.JsonValueSimdjson /twitter.json thrpt 5 1122.889 ± 316.755 ops/s
TwitterBenchmarkV3.recordJackson /twitter.json thrpt 5 716.739 ± 6.083 ops/s
TwitterBenchmarkV3.recordSimdjson /twitter.json thrpt 5 1824.830 ± 19.503 ops/s
TwitterBenchmarkV2.JsonValueSimdjson /twitter50.json thrpt 5 2338.579 ± 58.298 ops/s
TwitterBenchmarkV2.recordJackson /twitter50.json thrpt 5 1094.865 ± 2.898 ops/s
TwitterBenchmarkV2.recordSimdjson /twitter50.json thrpt 5 3333.782 ± 55.180 ops/s
TwitterBenchmarkV3.JsonValueSimdjson /twitter50.json thrpt 5 2243.374 ± 9.085 ops/s
TwitterBenchmarkV3.recordJackson /twitter50.json thrpt 5 1419.183 ± 17.172 ops/s
TwitterBenchmarkV3.recordSimdjson /twitter50.json thrpt 5 3475.266 ± 132.370 ops/s
TwitterBenchmarkV2.JsonValueSimdjson /twitter1.json thrpt 5 164348.941 ± 9737.617 ops/s
TwitterBenchmarkV2.recordJackson /twitter1.json thrpt 5 68143.603 ± 257.766 ops/s
TwitterBenchmarkV2.recordSimdjson /twitter1.json thrpt 5 81290.062 ± 1121.079 ops/s
TwitterBenchmarkV3.JsonValueSimdjson /twitter1.json thrpt 5 170856.785 ± 1570.948 ops/s
TwitterBenchmarkV3.recordJackson /twitter1.json thrpt 5 94823.108 ± 256.773 ops/s
TwitterBenchmarkV3.recordSimdjson /twitter1.json thrpt 5 105020.635 ± 1673.175 ops/s
In general, Java 21 usually performs better, which is not surprising. However, the problem you've described is still valid. Let me go through your questions to make sure we are on the same page:
The performance of Simdjson is not always faster than jackson? The shorter the JSON, the worse of Simdjson? If my JSON is short, I'd better not use simdjson?
In this question, you are referring to the poor performance of the schema-based parser in comparison to Jackson for shorter JSONs. Overall, in all the above cases, some version of simdjson beats Jackson.
DOM Parser vs Schema-Based Parser, the performance also depends on size of JSON? My first thought is Schema-Based is faster.
This question is strictly related to the previous one, as the schema-based parser again performs unexpectedly poorly.
If my interpretation of the benchmark results and your concerns is correct, then we can narrow down the problem to the performance of the schema-based parser. I've profiled it while running the TwitterBenchmarkV2
for twitter1.json
. This is what I got:
The flamegraph clearly shows that Java reflection is the culprit of the poor performance. Simdjson clears its internal cache of resolved classes every time the parse method is called. After commenting out the line in which the cache is cleared, I got the following results:
Benchmark (fileName) Mode Cnt Score Error Units
TwitterBenchmarkV2.JsonValueSimdjson /twitter1.json thrpt 5 175317.529 ± 2304.856 ops/s
TwitterBenchmarkV2.JsonValueSimdjson:·async /twitter1.json thrpt NaN ---
TwitterBenchmarkV2.recordJackson /twitter1.json thrpt 5 67562.644 ± 1220.035 ops/s
TwitterBenchmarkV2.recordJackson:·async /twitter1.json thrpt NaN ---
TwitterBenchmarkV2.recordSimdjson /twitter1.json thrpt 5 265694.392 ± 11438.461 ops/s
TwitterBenchmarkV2.recordSimdjson:·async /twitter1.json thrpt NaN ---
I suggest that you comment out this line and rerun the benchmarks in your environment. This is not the ultimate solution, of course. I just want to make sure that we are on the same page and that you don't see any other unexpected disparities between the parsers in terms of performance.
I suggest that you comment out this line and rerun the benchmarks in your environment.
We are now on the same page! I rerun the benchmarks with commenting out the classResolver.reset()
, the results are similar to you, as well as the flamegraph.
This is not the ultimate solution, of course.
BTW, is there a problem if commenting out the classResolver.reset()
for good? Maybe clear cache in a different place?
If we comment it out without changing anything else, there can be a problem because the cache will grow infinitely. In some cases, this is acceptable because the cache can contain as many entries as there are classes in the application.
I'll need to think about it. Perhaps the cache needs to have a more sophisticated eviction policy (LRU?).
Additionally, consider a real situation instead of benchmark, I believe the classResolver is different when parsing different JSON. So comment out the classResolver.reset()
can increase the score of benchmark, but if each JSON would be parsed once, simdjson is still not good enough when JSON is short?
Perhaps the cache needs to have a more sophisticated eviction policy (LRU?).
Perhaps the cache needs a fixed size(configurable)? Regardless of the eviction policy.
So comment out the classResolver.reset() can increase the score of benchmark, but if each JSON would be parsed once, simdjson is still not good enough when JSON is short?
This is the penalty for using reflection, so you would need to pay it regardless of which parser you use if the parser relies on it. I have an idea on how to replace reflection with an alternative approach, but it requires more research.
Also, I wonder how realistic this problem is. How many different schemas can you have in your system?
I have an idea on how to replace reflection with an alternative approach, but it requires more research.
Great! Looking forward to it.
I wonder how realistic this problem is.
TBH, I've no idea. IDK the JSON size distribution in real world. In my case, I do have some small JSON.
TBH, I've no idea. IDK the JSON size distribution in real world. In my case, I do have some small JSON.
This is understandable, but I wasn't asking about the size of the JSON. Your question was:
if each JSON would be parsed once, simdjson is still not good enough when JSON is short?
So, I assumed that you have a situation where there are many different types of JSON schemas, and every time you parse a JSON, you use a different schema. In such a scenario, the cache is useless because the parser cannot reuse the classes that are already in the cache. However, I cannot think of a scenario where you have, say, a million different schemas and use a different one each time. Is this your case?
I get your point. The number of different schemas should not be large. However,
<T> T walkDocument(byte[] padded, int len, Class<T> expectedType) {
jsonIterator.init(padded, len);
classResolver.reset();
I think in current code, the cache would be cleared even the expectedType
is always same?
if each JSON would be parsed once, simdjson is still not good enough when JSON is short?
What I mean JSON is JSON strings (i.e byte[] buffer
). If I have lots of JSON string(let's say 100 different JSON strings), but few schemas(let's say only 1). When I parse 100 JSON strings in sequence, even though schema is same, the cache would still be cleared 100 times? And this would meet the performance penalty above.
Yes, exactly. This is why I mentioned that replacing this simple eviction policy with a more sophisticated one could be a good improvement. The new policy would keep already resolved schemas between parse
method calls.
Just for interest's sake,
The flamegraph clearly shows that Java reflection is the culprit of the poor performance.
Jackson also use reflection, right? Why doesn't it show the poor performance?
But, is there any benchmark in which Jackson beats simdjson? I thought that for smaller inputs they are on a par.
Before commenting out, yes(https://github.com/ZhaiMo15/simdjson-java/blob/performanceTest/src/jmh/java/org/performance/TwitterBenchmarkV4.java), otherwise no. So jackson also uses reflection and cache but the cache is better than current simdjson (as we discussed above)?
I thought that for smaller inputs they are on a par.
Do you mean in smaller inputs case, the percentage of parsing(compared to reflection) is small, even simdjson can speed up parsing, the total performance is slightly changed?
The flamegraph of jackson:
IDK which part is the refection.
I've been testing the performance of Simdjson recently. The basic test is similar to default test, using twitter.json, as below:
What's different is I shrunk the size of statuses, default is 101, I tested 101, 51, and 1 respectively, the result is below: size 101:
size 51:
size 1:
What's more, I changed the depth of test, the default is 3 and I changed it to 2, as below:
The results are: size 101:
size 51:
size 1:
Here are my questions: