Closed GoogleCodeExporter closed 9 years ago
Please give a close read to
http://code.google.com/p/caliper/wiki/JavaMicrobenchmarks
and
http://code.google.com/p/caliper/wiki/JavaMicrobenchmarkReviewCriteria
Also take a close look at the contractual promises of LinkedHashMultimap and
LinkedListMultimap; if you want to compare apples to apples, I'd use
HashMultimap.
Then give each implementation the chance to perform at its best by using the
sizing parameters each makes available to you.
This should be a decent start. For now I don't see anything actionable here.
Original comment by kevinb@google.com
on 27 Oct 2010 at 7:58
I think you are being too hasty in declaring this a non-issue.
Side effects from Hotspot/JIT compiler/etc are negligible in the benchmark that
I have provided. The LinkedHashMultiMap issue that I have described caused a
real-life problem in our software product, which we have recently fixed by
switching over to use a LinkedListMultiMap. (In our software, that *single*
change reduced the running time of a certain process from ~8 minutes to ~3
seconds.)
The code that I've submitted here is just a simplified version of our original
problem, pared down to a simpler form to make it easier for you to read.
It is of course, up to you to decide whether this is an "actionable" situation.
As our own bug is now solved, my only stake here stems from trying to support
the guava project--which is why I feel I should reiterate those numbers again:
To insert 20000 already-constructed objects into a LinkedHashMultiMap takes
several orders of magnitude (i.e. 292 times) longer than inserting those exact
same elements into a LinkedListMultiMap (or, for that matter, into pretty any
other kind of non-sorted collection.)
In absolute terms, on my brand-new quad-core development machine, it takes over
*18 seconds* to add just 20000 elements!
-------------------
As for your suggestions (use HashMultiMap, and provide sizing parameters), I'm
afraid these things do not make much difference.
////////////////////////////////////////////
{
// pour the data into a hash multimap
long start = System.currentTimeMillis();
Multimap<String, String> mmap = HashMultimap.create();
for ( String key : data ) {
mmap.put( key, key );
}
long end = System.currentTimeMillis();
System.out.println("HashMultiMap took " + (end-start) + " ms");
}
{
// pour the data into a linked hash multimap
long start = System.currentTimeMillis();
Multimap<String, String> mmap =
LinkedHashMultimap.create( data.size(), 1 );
for ( String key : data ) {
mmap.put( key, key );
}
long end = System.currentTimeMillis();
System.out.println("LinkedHashMultiMap took " + (end-start) + " ms");
}
////////////////////////////////////////////////////////////
Running the code with those changes gave these (very similar) results:
HashMultiMap took 47 ms
LinkedHashMultiMap took 17722 ms
Manual Set took 15 ms
Original comment by cban...@gmail.com
on 27 Oct 2010 at 8:48
I'm getting similar results. I'm aware that there are potential
microbenchmarking issues, but this is very consistent, and orders of magnitude
difference, so I agree that the discrepancy likely not due to microbenchmarking
issues.
ArrayListMultimap took 14 ms
HashMultimap took 26 ms
LinkedHashMultimap took 8086 ms
LinkedListMultimap took 36 ms
TreeMultimap took 38 ms
A discovery I've made is that this behavior only appears when the key is equal
to the value. If I change things up so it says mmap.put(key, "a"), this
behavior no longer occurs. In this case, we get:
ArrayListMultimap took 13 ms
HashMultimap took 12 ms
LinkedHashMultimap took 34 ms
LinkedListMultimap took 24 ms
TreeMultimap took 39 ms
Actually, it seems like it's enough that the value has the same (or similar)
hash code. Using the following example:
private class MyObject {
private int hashCode;
public MyObject(int hashCode) { this.hashCode = hashCode; }
public boolean equals(Object obj) { return this == obj; }
public int hashCode() { return hashCode; }
}
If I change the code to be
mm.put( key, new MyObject(key.hashCode()) );
I get similar results to the first case:
ArrayListMultimap took 13 ms
HashMultimap took 37 ms
LinkedHashMultimap took 8280 ms
LinkedListMultimap took 39 ms
Whereas if I change the MyObject hashCode function to be
public int hashCode() { return hashCode * 2; }
or
public int hashCode() { return 0; }
I get around 70 ms for both cases.
If I change it to either of
public int hashCode() { return hashCode + 1; }
public int hashCode() { return hashCode - 1; }
it comes to around 3000 ms. So I'm guessing there's still some hash collisions
or something happening in this case, but not to the same extent.
Original comment by matthew.j.justin
on 27 Oct 2010 at 10:44
[deleted comment]
Looking at the r07 source, looks like each entry is stored in an
AbstractMapEntry, the hashcode of which does an XOR of the key and value.
So, if both hash codes are the same, all the entry hashcodes are zero, and the
"transient Collection<Map.Entry<K, V>> linkedEntries;" in LinkedHashMultimap
blows up.
Original comment by steven.m.reed
on 27 Oct 2010 at 10:53
Great catch Steven, that's it.
Original comment by jim.andreou
on 27 Oct 2010 at 11:41
So AbstractMapEntry should probably be computing the equivalent of
Arrays.hashCode(new Object() { getKey(), getValue() }).
Original comment by jim.andreou
on 27 Oct 2010 at 11:55
You guys work fast. :)
Original comment by cban...@gmail.com
on 27 Oct 2010 at 11:55
Except it can't compute the equivalent of Arrays.hashCode(), since it
implements Map.Entry, which states how the hashCode must be computed.
http://download.oracle.com/javase/6/docs/api/java/util/Map.Entry.html#hashCode()
Original comment by matthew.j.justin
on 28 Oct 2010 at 12:30
Map.Entry.hashCode() shouldn't prescribe a specific implementation like that in
its public contract.
- it encourages users of this method to rely on that implementation (or work
around it), which breaks encapsulation.
- if the specified implementation is flawed (as it is in this case), then
implementers of the method are forced to choose between breaking the prescribed
implementation or living with the flaw.
Personally, I'd go with breaking the implementation. Object.hashCode()
describes the needed contract adequately, and the additional comment in
Map.Entry.hashCode() should just be treated as a suggested implementation.
Instead of:
// current implementation
return (e.getKey()==null ? 0 : e.getKey().hashCode()) ^
(e.getValue()==null ? 0 : e.getValue().hashCode())
use something like:
final Object key = e.getKey();
final object val = e.getValue();
if ( key != null && val != null && key.equals(val) ) {
// edge case that causes issue 458
return key==null ? 0 : key.hashCode();
} else {
// original implementation
return (key==null ? 0 : key.hashCode()) ^
(val==null ? 0 : val.hashCode())
}
Original comment by cban...@gmail.com
on 28 Oct 2010 at 5:24
1. I do appreciate that you are bringing this to our attention.
2. I apologize for sounding dismissive of the entire issue. But note that I
didn't close it or anything, I just said I didn't feel it was actionable yet.
3. The report was framed as being about the performance of the class in general
while being obviously based on a very pathological data set. This tends to put
the maintainer of that class on the defensive. The most important reason I
cited those links was the part about using representative, real-world data. In
the course of doing that, it would have been discovered that the performance is
only this bad for this exact (odd) situation (a->a, b->b, etc.), and then the
problem could have been described more appropriately.
4. The report also seemed to compare LinkedHashMultimap to Map<K,Set> without
acknowledging the fact that LHM does have to do more work, after all, to
maintain the order of entries(). (Which explains the memory usage differences
reported later.)
5. Yes, I understand that the sheer scale of the difference between results
suggests that *something* bad is probably going on, despite the fact that the
benchmark is fundamentally invalid in several ways. I get that. However, I
still just plain don't like taking any action based on the results of an
invalid benchmark; it's as simple as that. The hardest part of getting it right
is that you still have to grab caliper from subversion and type 'ant', which I
admit is annoying.
6. Of course, a well-made benchmark will still show us this particular problem,
but it'll also show us how good a job we do of fixing it (if we can fix it),
which an invalid benchmark really can't.
I hope my hasty response at least makes a little more sense now, if still not
exactly a very good response.
Original comment by kevinb@google.com
on 28 Oct 2010 at 7:17
About Map.Entry's hashcode specification:
Map.Entry should specifiy its hashcode algorithm, given that there is a
specific contract for all collection framework interfaces to ensure that the
equivalence (both equals and hashcode) of all implementations of a given
collection interface is based on its contents alone and not the implementation.
Note that Map specifies its hashcode algorithm in terms of the hashcodes of its
entries... as such, Map.Entry implementations must also all have consistent
hashcodes.
However, whether it should or not and whether that algorithm is the best choice
or not are irrelevant, because it does specify that algorithm. And one thing
that sets Guava apart from other libraries attempting to serve a similar
functionality (e.g. Commons Collections) is that it does consistently adhere to
the contracts of the interfaces it implements.
Original comment by cgdec...@gmail.com
on 28 Oct 2010 at 8:20
I can see your point that Map.Entry.hashCode() probably does need a more
specific contract than Object.hashCode() (i.e. "equality must be defined only
in terms of equality of the key and value", etc.) But unless the Collections
Framework requires that all different Map.Entry implementations produce the
exact *same* hashcode values, there's still no need to specify a concrete
implementation in the method comment.
Of course, as far as patching gauva is concerned goes, I have no say at all.
The fix I mentioned above is just a suggestion. :)
Original comment by cban...@gmail.com
on 28 Oct 2010 at 8:43
It does require that all different Map.Entry implementations produce the exact
same hashcode values (for the same key and value)! That's why it specifies the
algorithm.
In the collections framework, the equality of two collections should be based
on the elements contained in each, regardless of their implementations. So any
two Maps that contain the same key/value pairs must be equal, regardless of the
Map implementation. Additionally, any two Objects that are equal must have the
same hashCode(). Given that, all implementations of Map must use the same
algorithm for equals() and the same algorithm for hashCode(). Those algorithms
are both based on the entries of the map. Therefore, all implementations of
Map.Entry must use the same hashCode() algorithm or else two Maps with the same
key/value pairs may be equal() but have different hashCode()s.
Far from encouraging users to rely on the implementation they're using, I
believe this actually helps users ignore the implementation they're using. If
the hashcodes of two Maps with the same elements were not guaranteed to be
equal, changing the implementation you're using somewhere could break your code.
Original comment by cgdec...@gmail.com
on 28 Oct 2010 at 9:45
>In the collections framework, the equality of two collections should be based
on the elements contained in each, regardless of their implementations.
Ahh, right...and since hashCode() must always agree with equals(), then the
hashCode() method for collections is similarly constrained.
Interesting, thanks for the explanation. That makes sense.
Original comment by cban...@gmail.com
on 28 Oct 2010 at 10:03
Do you think the following two key/value pairs are equals : a=>a and b=>b ? and
so their hashCode should be equals ?
If you really look closely at the Map specification, it is quoted :
"Implementations are free to implement optimizations whereby the equals
invocation is avoided, for example, by first comparing the hash codes of the
two keys. (The Object.hashCode() specification guarantees that two objects with
unequal hash codes cannot be equal.) More generally, implementations of the
various Collections Framework interfaces are free to take advantage of the
specified behavior of underlying Object methods wherever the implementor deems
it appropriate.".
And the Object.hashCode say :
"It is not required that if two objects are unequal according to the
equals(java.lang.Object) method, then calling the hashCode method on each of
the two objects must produce distinct integer results. However, the programmer
should be aware that producing distinct integer results for unequal objects may
improve the performance of hashtables."
Say simple :
if A == B => #A == #B (A equals B then A hashCode must be equals of B hashCode)
if A != B => #A == #B or #A != #B (A not equals B then we don't care about the
hashCode equality)
and if #A == #B => the equality of A and B remain unknown
So for me, there is a bug in the Map interface documentation and so, we must
stick with the Object.equals and Object.hashCode contract first !
Original comment by amer...@gmail.com
on 28 Oct 2010 at 10:46
Finally a=>a and b=>b are not equals (are they ?), so we don't care about their
hashCode.
Original comment by amer...@gmail.com
on 28 Oct 2010 at 10:57
The only viable "solution" I see is having broken (with a """better"""
hashcode) entries internally, and copy them to correct entries on the way out.
But this would decrease performance for everyone (let alone the internal
implementation effort and ugliness), just to serve a very special kind of
entries. So if anything is to be done, this should at most be appending a small
warning in javadocs, in fine-print, if at all (do we document all kind of data
that would create degeneracies?).
Original comment by jim.andreou
on 28 Oct 2010 at 11:05
And to be funny, it is Josh Bloch who wrote Map interface and also wrote in the
very good Effective Java :
"Many classes in the Java platform libraries, such as String, Integer, and
Date, specify the
exact value returned by their hashCode method as a function of the instance
value. This is
generally not a good idea, as it severely limits your ability to improve the
hash function in
future releases. If you leave the details of a hash function unspecified and a
flaw is found in it,
you can fix the hash function in the next release without fear of breaking
compatibility with
clients who depend on the exact values returned by the hash function
"
Map.Entry.hashCode calculation should not have been specified ;-)
Original comment by amer...@gmail.com
on 28 Oct 2010 at 11:09
> Map.Entry.hashCode calculation should not have been specified ;-)
Actually, cgdecker's earlier comments explain why an exception has to be made
for Collection objects.
As far as fixing this degenerate case, the only thing I can suggest now would
be perhaps finding a different way to store the data internally that sidesteps
the need to use hashing at all.
Original comment by cban...@gmail.com
on 28 Oct 2010 at 11:15
Here's an idea:
We can't change the specification of Map.Entry.hashCode(), but implementations
are free to use that hashcode how they want. Many impls "smear" the hashcode to
generate better hashing.
If a Map.Entry's hashCode is 0, the implementation can try just taking the
hashcode of the key alone.
So, something like comment #10, only external to the Map.Entry, and using the
hashcodes.
// This is used instead of the normal hashCode, internal to a particular
implementation
private static int hash (Map.Entry<?, ?> entry)
{
Object key = entry.getKey();
Object val = entry.getValue();
int keyHash = (key == null) ? 0 : key.hashCode();
int valHash = (val == null) ? 0 : val.hashCode();
return (keyHash == valHash) ? keyHash : (keyHash ^ valHash);
}
I shouldn't have to prove that this will return the same int value for any two
Map.Entry instances that are equals().
Original comment by ray.j.gr...@gmail.com
on 29 Oct 2010 at 8:41
I"m not inclined to modify LinkedHashedMulitmap to improve performance in this
degenerate case.
However, there's a workaround that provides most of the functionality of a
LinkedHashedMulitmap. Call Multimaps.newSetMultimap(LinkedHashMap,
Supplier<LinkedHashSet>). The keys and values for a given key will follow the
insertion ordering, though the entries() ordering will differ. Insertions will
be roughly twice as fast as a LinkedHashMultimap and the key=value performance
will be fine.
Original comment by jl...@google.com
on 5 Nov 2010 at 7:25
Original comment by kevinb@google.com
on 18 Jul 2011 at 3:39
FYI, I think this issue has actually been resolved by
http://code.google.com/p/guava-libraries/source/detail?r=c888c6bbb226998980f3fe8
4587f1f0606578c88 which avoid keeping a HashSet of the entries themselves.
Original comment by wasserman.louis
on 11 Jun 2012 at 8:40
This issue has been migrated to GitHub.
It can be found at https://github.com/google/guava/issues/<id>
Original comment by cgdecker@google.com
on 1 Nov 2014 at 4:15
Original comment by cgdecker@google.com
on 3 Nov 2014 at 9:09
Original issue reported on code.google.com by
cban...@gmail.com
on 27 Oct 2010 at 5:46