Closed plokhotnyuk closed 6 years ago
Here's the same issue on Jackson:
https://github.com/FasterXML/jackson-databind/issues/37
According to that, it was fixed by something that JDK7/JDK8 did to its HashMap implementation, but it doesn't say what. We're not using a JDK HashMap, we're using Scala's immutable HashMap which I think is a trie map that uses the hash code.
So potentially we could switch to a JDK HashMap and this would fix it, but I suspect that will adversely impact performance when generating JSON AST, since the JDK HashMap doesn't support efficient persistent add operations, for every field added we would have to copy the map.
Other than that we'll have to research to find out what the JDK did to fix this, and either port that to Scala's map, or implement our own trie map that provides the fix. If it can be done simply by providing a custom hashcode function, then we might also be able to achieve it by wrapping the keys, though this would generate more garbage.
Ok, here's the supposed JDK fix:
http://mail.openjdk.java.net/pipermail/core-libs-dev/2012-May/010238.html
The thing is, I had a look at the fixes proposed, and I can't find any evidence that those fixes exist in the JDK today.
Here is how it works in the latest JDK: https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/f5386f0d6182d877e847ed15d197a98bf8002523/src/java.base/share/classes/java/util/HashMap.java#L145-L148
Ah, found it, Doug Lea proposed an alternative and much more elegant solution:
http://mail.openjdk.java.net/pipermail/core-libs-dev/2012-June/010425.html
This is in the JDK today and documented in the HashMap
javadocs:
Note that using many keys with the same
hashCode()
is a sure way to slow down performance of any hash table. To ameliorate impact, when keys are Comparable, this class may use comparison order among keys to help break ties.
Also from the HashMap
source code:
/*
* Implementation notes.
*
* This map usually acts as a binned (bucketed) hash table, but
* when bins get too large, they are transformed into bins of
* TreeNodes, each structured similarly to those in
* java.util.TreeMap. Most methods try to use normal bins, but
* relay to TreeNode methods when applicable (simply by checking
* instanceof a node). Bins of TreeNodes may be traversed and
* used like any others, but additionally support faster lookup
* when overpopulated. However, since the vast majority of bins in
* normal use are not overpopulated, checking for existence of
* tree bins may be delayed in the course of table methods.
*
* Tree bins (i.e., bins whose elements are all TreeNodes) are
* ordered primarily by hashCode, but in the case of ties, if two
* elements are of the same "class C implements Comparable<C>",
* type then their compareTo method is used for ordering. (We
* conservatively check generic types via reflection to validate
* this -- see method comparableClassFor). The added complexity
* of tree bins is worthwhile in providing worst-case O(log n)
* operations when keys either have distinct hashes or are
* orderable, Thus, performance degrades gracefully under
* accidental or malicious usages in which hashCode() methods
* return values that are poorly distributed, as well as those in
* which many keys share a hashCode, so long as they are also
* Comparable. (If neither of these apply, we may waste about a
* factor of two in time and space compared to taking no
* precautions. But the only known cases stem from poor user
* programming practices that are already so slow that this makes
* little difference.)
*
* Because TreeNodes are about twice the size of regular nodes, we
* use them only when bins contain enough nodes to warrant use
* (see TREEIFY_THRESHOLD). And when they become too small (due to
* removal or resizing) they are converted back to plain bins.
I think we can solve this in a somewhat similar manner in play-json's JsObject
. We have an advantage here in that we are always working with String
, so it is always Comparable
, so no need to do any reflection hacks. We also can make some assumptions about numbers that will be encountered in the real world - most JSON structures won't have more than 100 fields. So my suggestion is that we use the default Scala scala.collection.immutable.Map
(hash trie) for objects with 100 fields or less, and revert to a scala.collection.immutable.TreeMap
for objects with more than 100 fields. I think this won't achieve the expected behaviour above (sub linear decreasing of throughput), rather it will achieve sub n log n decreasing of throughput, but that's a lot more graceful than sub quadratic, and it's also the same performance degradation that is seen in the JDK HashMap implementation. It does mean for the rare cases where JSON objects do have 100+ fields that they will also see a performance degradation, perhaps we can tweak that number, but it is only logarithmic operations instead of constant time operations, which is not too bad.
@plokhotnyuk sorry I was in the middle of researching and writing my comment when you posted yours :)
Issue raised to get this fixed in Scala:
Play JSON Version (2.5.x / etc)
2.7.0-M1
API (Scala / Java / Neither / Both)
Scala 2.12.7
Operating System (Ubuntu 15.10 / MacOS 10.10 / Windows 10)
Ubuntu 16.04
JDK (Oracle 1.8.0_72, OpenJDK 1.8.x, Azul Zing)
Oracle JDK 11
Library Dependencies
none
Expected Behavior
Sub-linear decreasing of throughput when number of JSON object fields (with keys that have the same hash code) is increasing
Actual Behavior
Sub-quadratic decreasing of throughput when number of JSON object fields (with keys that have the same hash code) is increasing
On contemporary CPUs parsing of such JSON object (with a sequence of 100000 fields like below that is ~1.4Mb) can took more than 100 seconds:
Below are results of benchmark (where
size
is a number of such fields) for different JSON parser including Play-JSON:Reproducible Test Case
To run that benchmarks on your JDK:
Install latest version of
sbt
and/or ensure that it already installed properly:Clone
jsoniter-scala
repo:Enter to the cloned directory and checkout for the specific commit:
Run benchmarks using a path parameter to your JDK: