kucci / guava-libraries

Automatically exported from code.google.com/p/guava-libraries
Apache License 2.0
0 stars 0 forks source link

LinkedHashMultimap performance degrades badly for pathological input {a=[a], b=[b], ...} #458

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?

1. Compile and run the following code:

//////////////////////////////////////////////////////////////////////////
static public void main(String [] args) throws Exception {

      // build data
      List<String> data = Lists.newArrayList();
      for (int i = 0; i < 20000; i++) {
         data.add( Integer.toString( i ) );
      }

      // pour the data into a linked list multimap
      {
         long start = System.currentTimeMillis();
         Multimap<String, String> mmap = LinkedListMultimap.create();
         for ( String key : data ) {
            mmap.put( key, key );
         }
         long end = System.currentTimeMillis();
         System.out.println("LinkedListMultiMap took "+(end-start)+" ms");
      }

      {
         // pour the data into a linked hash multimap
         long start = System.currentTimeMillis();
         Multimap<String, String> mmap = LinkedHashMultimap.create();
         for ( String key : data ) {
            mmap.put( key, key );
         }
         long end = System.currentTimeMillis();
         System.out.println("LinkedHashMultiMap took "+(end-start)+" ms");
      }

      {
         // pour the data into "hand made" multimap
         long start = System.currentTimeMillis();
         HashMap<String, HashSet<String>> mmap = Maps.newLinkedHashMap();
         for ( String key : data ) {
            HashSet<String> set = Sets.newHashSet();
            set.add(key);
            mmap.put( key, set );
         }
         long end = System.currentTimeMillis();
         System.out.println("Manual multimap took "+(end-start)+" ms");
      }

   }
}
/////////////////////////////////////////////////////////////////////////

The output from this code when I run it is:

LinkedListMultiMap took 62 ms
LinkedHashMultiMap took 18128 ms
Manual multimap took 15 ms

What is the expected output? What do you see instead?

It seems to me that the LinkedHashMultiMap performance is so far out of line 
with the other two similiar techniques that there must be a bug in it 
somewhere.  If not, perhaps there should be a warning in the javadoc that 
adding elements to the LinkedHashMultiMap is particularly slow.

What version of the product are you using? On what operating system?

This happens in gauva-r07, on Windows 7.

Original issue reported on code.google.com by cban...@gmail.com on 27 Oct 2010 at 5:46

GoogleCodeExporter commented 9 years ago
Please give a close read to

http://code.google.com/p/caliper/wiki/JavaMicrobenchmarks
and
http://code.google.com/p/caliper/wiki/JavaMicrobenchmarkReviewCriteria

Also take a close look at the contractual promises of LinkedHashMultimap and 
LinkedListMultimap; if you want to compare apples to apples, I'd use 
HashMultimap.

Then give each implementation the chance to perform at its best by using the 
sizing parameters each makes available to you.

This should be a decent start.  For now I don't see anything actionable here.

Original comment by kevinb@google.com on 27 Oct 2010 at 7:58

GoogleCodeExporter commented 9 years ago
I think you are being too hasty in declaring this a non-issue.

Side effects from Hotspot/JIT compiler/etc are negligible in the benchmark that 
I have provided.  The LinkedHashMultiMap issue that I have described caused a 
real-life problem in our software product, which we have recently fixed by 
switching over to use a LinkedListMultiMap.  (In our software, that *single* 
change reduced the running time of a certain process from ~8 minutes to ~3 
seconds.)

The code that I've submitted here is just a simplified version of our original 
problem, pared down to a simpler form to make it easier for you to read.

It is of course, up to you to decide whether this is an "actionable" situation. 
  As our own bug is now solved, my only stake here stems from trying to support 
the guava project--which is why I feel I should reiterate those numbers again:  

To insert 20000 already-constructed objects into a LinkedHashMultiMap takes 
several orders of magnitude (i.e. 292 times) longer than inserting those exact 
same elements into a LinkedListMultiMap (or, for that matter, into pretty any 
other kind of non-sorted collection.)

In absolute terms, on my brand-new quad-core development machine, it takes over 
*18 seconds* to add just 20000 elements!

-------------------

As for your suggestions (use HashMultiMap, and provide sizing parameters), I'm 
afraid these things do not make much difference.

////////////////////////////////////////////

      {
         // pour the data into a hash multimap
         long start = System.currentTimeMillis();
         Multimap<String, String> mmap = HashMultimap.create();
         for ( String key : data ) {
            mmap.put( key, key );
         }
         long end = System.currentTimeMillis();
         System.out.println("HashMultiMap took " + (end-start) + " ms");
      }

      {
         // pour the data into a linked hash multimap
         long start = System.currentTimeMillis();
         Multimap<String, String> mmap =
            LinkedHashMultimap.create( data.size(), 1 );
         for ( String key : data ) {
            mmap.put( key, key );
         }
         long end = System.currentTimeMillis();
         System.out.println("LinkedHashMultiMap took " + (end-start) + " ms");
      }

////////////////////////////////////////////////////////////

Running the code with those changes gave these (very similar) results:

HashMultiMap took 47 ms
LinkedHashMultiMap took 17722 ms
Manual Set took 15 ms

Original comment by cban...@gmail.com on 27 Oct 2010 at 8:48

GoogleCodeExporter commented 9 years ago
I'm getting similar results.  I'm aware that there are potential 
microbenchmarking issues, but this is very consistent, and orders of magnitude 
difference, so I agree that the discrepancy likely not due to microbenchmarking 
issues.

ArrayListMultimap took 14 ms
HashMultimap took 26 ms
LinkedHashMultimap took 8086 ms
LinkedListMultimap took 36 ms
TreeMultimap took 38 ms

A discovery I've made is that this behavior only appears when the key is equal 
to the value.  If I change things up so it says mmap.put(key, "a"), this 
behavior no longer occurs.  In this case, we get:

ArrayListMultimap took 13 ms
HashMultimap took 12 ms
LinkedHashMultimap took 34 ms
LinkedListMultimap took 24 ms
TreeMultimap took 39 ms

Actually, it seems like it's enough that the value has the same (or similar) 
hash code.  Using the following example:

    private class MyObject {
        private int hashCode;

        public MyObject(int hashCode) { this.hashCode = hashCode; }

        public boolean equals(Object obj) { return this == obj; }

        public int hashCode() { return hashCode; }
    }

If I change the code to be 
  mm.put( key, new MyObject(key.hashCode()) );
I get similar results to the first case:

ArrayListMultimap took 13 ms
HashMultimap took 37 ms
LinkedHashMultimap took 8280 ms
LinkedListMultimap took 39 ms

Whereas if I change the MyObject hashCode function to be
    public int hashCode() { return hashCode * 2; }
or
    public int hashCode() { return 0; }
I get around 70 ms for both cases.

If I change it to either of
    public int hashCode() { return hashCode + 1; }
    public int hashCode() { return hashCode - 1; }
it comes to around 3000 ms.  So I'm guessing there's still some hash collisions 
or something happening in this case, but not to the same extent.

Original comment by matthew.j.justin on 27 Oct 2010 at 10:44

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
Looking at the r07 source, looks like each entry is stored in an 
AbstractMapEntry, the hashcode of which does an XOR of the key and value.

So, if both hash codes are the same, all the entry hashcodes are zero, and the 
"transient Collection<Map.Entry<K, V>> linkedEntries;" in LinkedHashMultimap 
blows up.

Original comment by steven.m.reed on 27 Oct 2010 at 10:53

GoogleCodeExporter commented 9 years ago
Great catch Steven, that's it. 

Original comment by jim.andreou on 27 Oct 2010 at 11:41

GoogleCodeExporter commented 9 years ago
So AbstractMapEntry should probably be computing the equivalent of 
Arrays.hashCode(new Object() { getKey(), getValue() }). 

Original comment by jim.andreou on 27 Oct 2010 at 11:55

GoogleCodeExporter commented 9 years ago
You guys work fast. :)

Original comment by cban...@gmail.com on 27 Oct 2010 at 11:55

GoogleCodeExporter commented 9 years ago
Except it can't compute the equivalent of Arrays.hashCode(), since it 
implements Map.Entry, which states how the hashCode must be computed.

http://download.oracle.com/javase/6/docs/api/java/util/Map.Entry.html#hashCode()

Original comment by matthew.j.justin on 28 Oct 2010 at 12:30

GoogleCodeExporter commented 9 years ago
Map.Entry.hashCode() shouldn't prescribe a specific implementation like that in 
its public contract.  

- it encourages users of this method to rely on that implementation (or work 
around it), which breaks encapsulation.

- if the specified implementation is flawed (as it is in this case), then 
implementers of the method are forced to choose between breaking the prescribed 
implementation or living with the flaw.

Personally, I'd go with breaking the implementation.  Object.hashCode() 
describes the needed contract adequately, and the additional comment in 
Map.Entry.hashCode() should just be treated as a suggested implementation.

Instead of:

// current implementation
return (e.getKey()==null   ? 0 : e.getKey().hashCode()) ^
     (e.getValue()==null ? 0 : e.getValue().hashCode())

use something like:

final Object key = e.getKey();
final object val = e.getValue();
if ( key != null && val != null && key.equals(val) ) {
    // edge case that causes issue 458
    return key==null ? 0 : key.hashCode();
} else {
   // original implementation
   return (key==null   ? 0 : key.hashCode()) ^ 
      (val==null ? 0 : val.hashCode())
}

Original comment by cban...@gmail.com on 28 Oct 2010 at 5:24

GoogleCodeExporter commented 9 years ago
1. I do appreciate that you are bringing this to our attention.

2. I apologize for sounding dismissive of the entire issue. But note that I 
didn't close it or anything, I just said I didn't feel it was actionable yet.

3. The report was framed as being about the performance of the class in general 
while being obviously based on a very pathological data set. This tends to put 
the maintainer of that class on the defensive. The most important reason I 
cited those links was the part about using representative, real-world data. In 
the course of doing that, it would have been discovered that the performance is 
only this bad for this exact (odd) situation (a->a, b->b, etc.), and then the 
problem could have been described more appropriately.

4. The report also seemed to compare LinkedHashMultimap to Map<K,Set> without 
acknowledging the fact that LHM does have to do more work, after all, to 
maintain the order of entries(). (Which explains the memory usage differences 
reported later.)

5. Yes, I understand that the sheer scale of the difference between results 
suggests that *something* bad is probably going on, despite the fact that the 
benchmark is fundamentally invalid in several ways.  I get that.  However, I 
still just plain don't like taking any action based on the results of an 
invalid benchmark; it's as simple as that. The hardest part of getting it right 
is that you still have to grab caliper from subversion and type 'ant', which I 
admit is annoying.

6. Of course, a well-made benchmark will still show us this particular problem, 
but it'll also show us how good a job we do of fixing it (if we can fix it), 
which an invalid benchmark really can't.

I hope my hasty response at least makes a little more sense now, if still not 
exactly a very good response.

Original comment by kevinb@google.com on 28 Oct 2010 at 7:17

GoogleCodeExporter commented 9 years ago
About Map.Entry's hashcode specification:

Map.Entry should specifiy its hashcode algorithm, given that there is a 
specific contract for all collection framework interfaces to ensure that the 
equivalence (both equals and hashcode) of all implementations of a given 
collection interface is based on its contents alone and not the implementation. 
Note that Map specifies its hashcode algorithm in terms of the hashcodes of its 
entries... as such, Map.Entry implementations must also all have consistent 
hashcodes.

However, whether it should or not and whether that algorithm is the best choice 
or not are irrelevant, because it does specify that algorithm. And one thing 
that sets Guava apart from other libraries attempting to serve a similar 
functionality (e.g. Commons Collections) is that it does consistently adhere to 
the contracts of the interfaces it implements.

Original comment by cgdec...@gmail.com on 28 Oct 2010 at 8:20

GoogleCodeExporter commented 9 years ago
I can see your point that Map.Entry.hashCode() probably does need a more 
specific contract than Object.hashCode() (i.e. "equality must be defined only 
in terms of equality of the key and value", etc.)   But unless the Collections 
Framework requires that all different Map.Entry implementations produce the 
exact *same* hashcode values, there's still no need to specify a concrete 
implementation in the method comment.

Of course, as far as patching gauva is concerned goes, I have no say at all.  
The fix I mentioned above is just a suggestion. :)

Original comment by cban...@gmail.com on 28 Oct 2010 at 8:43

GoogleCodeExporter commented 9 years ago
It does require that all different Map.Entry implementations produce the exact 
same hashcode values (for the same key and value)! That's why it specifies the 
algorithm.

In the collections framework, the equality of two collections should be based 
on the elements contained in each, regardless of their implementations. So any 
two Maps that contain the same key/value pairs must be equal, regardless of the 
Map implementation. Additionally, any two Objects that are equal must have the 
same hashCode(). Given that, all implementations of Map must use the same 
algorithm for equals() and the same algorithm for hashCode(). Those algorithms 
are both based on the entries of the map. Therefore, all implementations of 
Map.Entry must use the same hashCode() algorithm or else two Maps with the same 
key/value pairs may be equal() but have different hashCode()s.

Far from encouraging users to rely on the implementation they're using, I 
believe this actually helps users ignore the implementation they're using. If 
the hashcodes of two Maps with the same elements were not guaranteed to be 
equal, changing the implementation you're using somewhere could break your code.

Original comment by cgdec...@gmail.com on 28 Oct 2010 at 9:45

GoogleCodeExporter commented 9 years ago
>In the collections framework, the equality of two collections should be based 
on the elements contained in each, regardless of their implementations.

Ahh, right...and since hashCode() must always agree with equals(), then the 
hashCode() method for collections is similarly constrained.

Interesting, thanks for the explanation.  That makes sense.

Original comment by cban...@gmail.com on 28 Oct 2010 at 10:03

GoogleCodeExporter commented 9 years ago
Do you think the following two key/value pairs are equals : a=>a and b=>b ? and 
so their hashCode should be equals ?

If you really look closely at the Map specification, it is quoted : 

"Implementations are free to implement optimizations whereby the equals 
invocation is avoided, for example, by first comparing the hash codes of the 
two keys. (The Object.hashCode() specification guarantees that two objects with 
unequal hash codes cannot be equal.) More generally, implementations of the 
various Collections Framework interfaces are free to take advantage of the 
specified behavior of underlying Object methods wherever the implementor deems 
it appropriate.".

And the Object.hashCode say :

"It is not required that if two objects are unequal according to the 
equals(java.lang.Object) method, then calling the hashCode method on each of 
the two objects must produce distinct integer results. However, the programmer 
should be aware that producing distinct integer results for unequal objects may 
improve the performance of hashtables."

Say simple :

if A == B => #A == #B (A equals B then A hashCode must be equals of B hashCode)
if A != B => #A == #B or #A != #B (A not equals B then we don't care about the 
hashCode equality)

and if #A == #B => the equality of A and B remain unknown

So for me, there is a bug in the Map interface documentation and so, we must 
stick with the Object.equals and Object.hashCode contract first !

Original comment by amer...@gmail.com on 28 Oct 2010 at 10:46

GoogleCodeExporter commented 9 years ago
Finally a=>a and b=>b are not equals (are they ?), so we don't care about their 
hashCode.

Original comment by amer...@gmail.com on 28 Oct 2010 at 10:57

GoogleCodeExporter commented 9 years ago
The only viable "solution" I see is having broken (with a """better""" 
hashcode) entries internally, and copy them to correct entries on the way out. 
But this would decrease performance for everyone (let alone the internal 
implementation effort and ugliness), just to serve a very special kind of 
entries. So if anything is to be done, this should at most be appending a small 
warning in javadocs, in fine-print, if at all (do we document all kind of data 
that would create degeneracies?).

Original comment by jim.andreou on 28 Oct 2010 at 11:05

GoogleCodeExporter commented 9 years ago
And to be funny, it is Josh Bloch who wrote Map interface and also wrote in the 
very good Effective Java : 

"Many classes in the Java platform libraries, such as String, Integer, and 
Date, specify the
exact value returned by their hashCode method as a function of the instance 
value. This is
generally not a good idea, as it severely limits your ability to improve the 
hash function in
future releases. If you leave the details of a hash function unspecified and a 
flaw is found in it,
you can fix the hash function in the next release without fear of breaking 
compatibility with
clients who depend on the exact values returned by the hash function
"

Map.Entry.hashCode calculation should not have been specified ;-)

Original comment by amer...@gmail.com on 28 Oct 2010 at 11:09

GoogleCodeExporter commented 9 years ago
> Map.Entry.hashCode calculation should not have been specified ;-)

Actually, cgdecker's earlier comments explain why an exception has to be made 
for Collection objects.

As far as fixing this degenerate case, the only thing I can suggest now would 
be perhaps finding a different way to store the data internally that sidesteps 
the need to use hashing at all.

Original comment by cban...@gmail.com on 28 Oct 2010 at 11:15

GoogleCodeExporter commented 9 years ago
Here's an idea:

We can't change the specification of Map.Entry.hashCode(), but implementations 
are free to use that hashcode how they want. Many impls "smear" the hashcode to 
generate better hashing.

If a Map.Entry's hashCode is 0, the implementation can try just taking the 
hashcode of the key alone.

So, something like comment #10, only external to the Map.Entry, and using the 
hashcodes.
// This is used instead of the normal hashCode, internal to a particular 
implementation
private static int hash (Map.Entry<?, ?> entry)
{
    Object key = entry.getKey();
    Object val = entry.getValue();
    int keyHash = (key == null) ? 0 : key.hashCode();
    int valHash = (val == null) ? 0 : val.hashCode();
    return (keyHash == valHash) ? keyHash : (keyHash ^ valHash);
}

I shouldn't have to prove that this will return the same int value for any two 
Map.Entry instances that are equals().

Original comment by ray.j.gr...@gmail.com on 29 Oct 2010 at 8:41

GoogleCodeExporter commented 9 years ago
I"m not inclined to modify LinkedHashedMulitmap to improve performance in this 
degenerate case.

However, there's a workaround that provides most of the functionality of a 
LinkedHashedMulitmap. Call Multimaps.newSetMultimap(LinkedHashMap, 
Supplier<LinkedHashSet>). The keys and values for a given key will follow the 
insertion ordering, though the entries() ordering will differ. Insertions will 
be roughly twice as fast as a LinkedHashMultimap and the key=value performance 
will be fine.

Original comment by jl...@google.com on 5 Nov 2010 at 7:25

GoogleCodeExporter commented 9 years ago

Original comment by kevinb@google.com on 18 Jul 2011 at 3:39

GoogleCodeExporter commented 9 years ago
FYI, I think this issue has actually been resolved by 
http://code.google.com/p/guava-libraries/source/detail?r=c888c6bbb226998980f3fe8
4587f1f0606578c88 which avoid keeping a HashSet of the entries themselves.

Original comment by wasserman.louis on 11 Jun 2012 at 8:40

GoogleCodeExporter commented 9 years ago
This issue has been migrated to GitHub.

It can be found at https://github.com/google/guava/issues/<id>

Original comment by cgdecker@google.com on 1 Nov 2014 at 4:15

GoogleCodeExporter commented 9 years ago

Original comment by cgdecker@google.com on 3 Nov 2014 at 9:09