Open GoogleCodeExporter opened 9 years ago
Hi Atul,
Interesting.
The original purpose of the eager calculation of hashCode (calcHashCode()) in
the Query objects, is for StandingQueryIndex, with the expectation that query
fragments would be reused a lot. All queries, once constructed, are immutable
so my rationale was that it made sense to cache it for such reuse.
However this all hinges on whether queries will typically be reused, or not -
and reused with StandingQueryIndex. By reused I mean will the exact query
objects be cached and used again, and not equivalent query objects constructed
again. I originally envisaged that queries in many applications would be cached
and reused against changing datasets. But thinking about it now, I think in
hindsight it might have been the wrong direction.
I've accepted this as a bug, and applied a fix in trunk. The fix is to make
hashCode non-final, nullable, and to calculate it on-demand and not eagerly.
Once calculated, it will be reused to avoid recalculation. This should provide
the best of both worlds.
I'll leave this issue open until the next release of CQEngine, which will
include the fix. In the meantime you could svn checkout and mvn install
locally. Hopefully this will fix the bottleneck.
Thanks!
Niall
Original comment by ni...@npgall.com
on 30 Oct 2013 at 1:32
Hi Niall,
Thanks for the quick fix. Even though Java Memory model ensures assignment of References to be atomic, it is slightly more expensive to ensure atomicity. (Note that assignments to double/long are not atomic). I recommend, moving to "int" instead of Integer to keep hashCode.
You can see hashCode() implementation of String.java to see how they use this guarantee of Java Memory Model, to prevent synchronization.
http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/jav
a/lang/String.java#String.hashCode%28%29
Original comment by anita.v...@inmobi.com
on 30 Oct 2013 at 9:29
Thanks Niall, this seem to have fixed our issue in production.
Original comment by anita.v...@inmobi.com
on 30 Oct 2013 at 10:58
Hi Atul,
Great that this fixed the bottleneck.
I made a few tradeoffs in selecting Integer here.
Regarding the java memory model, I don't think the example in
java.lang.String.hashCode() is availing of atomicity or a guarantee provided by
the JMM as a way to avoid synchronization. java.lang.String.hash is not
volatile, which means that writes to it may be atomic, but there is no
visibility guarantee. If thread A calls hashCode() and hash has not been
computed, thread A will compute and store it. But if thread B or other threads
then call hashCode() any time later, they are not guaranteed to see the hash
value stored by thread A. They might all individually see 0 for an arbitrary
period of time, and they may each recompute it. Worst case it would be like
thread-local (actually CPU core-local). There is no coordination between
threads at all.
I think Sun/Oracle were relying on the idempotence of the hash function here to
allow multiple threads to recompute hashcode needlessly, rather than a
guarantee provided by the JMM. I was relying on idempotence in a similar way in
my use of a non-volatile Integer, knowing that it could be recomputed
needlessly on multicore systems, but that the hash functions were idempotent.
I've seen Sun/Oracle use idempotence to avoid volatile reads/writes in a few
places, so I guess there is some merit in it. Number of recomputes must be rare
in practice, and the worst case would be bounded by the number of cores in the
system.
Also in java.lang.String.hashCode(), there is additional weirdness. They seem
to use 0 as both a hashcode, and a special value to indicate that hashcode has
not yet been computed. I'm sure they had good justification, but if a
particular sequence of characters actually computes to hashcode 0, I think then
java.lang.String will recompute the hashcode every time hashCode() is called on
those strings. So some instances of java.lang.String might not benefit from
caching at all. In CQEngine, applications might supply objects where hashcode 0
is more common, so I can't use 0 as a special value.
I agree that a data type smaller than an object reference will be cheaper.
Using Integer was lazy :) Before the next release I will add a boolean
hashCodeComputed flag, and make hashCode an int. This will be cheaper and avoid
object allocations for Integer. It may further improve performance a bit.
Best regards,
Niall
Original comment by ni...@npgall.com
on 30 Oct 2013 at 1:11
hashCode() in string doesn't make sure two threads up-date simultaneously, it
only makes sure, that hash code is correct. and if possible avoid
recomputation. Two threads may compete and end up computing the same value
again. Also it is possible that recomputation happens whenever the value is
zero.
Imagine a 64 bit value in a 32 bit machine. Imagine ab is a long variable.
ab = (1 << 32 + 2)
it will write to location ab, 1 and 2 in two CPU cycles. So if another thread
is trying to write:
ab = (3 << 32 + 4)
You might end up with:
ab = 1 << 32 + 4
or
ab = 3 << 32 + 2
But Java memory model promises this will NOT HAPPEN for refernces, but MAY
HAPPEN for double and long. Even though References are 64-bit values.
This is ensured by additional mechanism which is expensive in a 32 bit machine.
Using a boolean and int will not work, because you will be subject to
code-reordering issues. One will have to either "synchronize/lock", or
introduce artificial memory barriers (or use volatiles)
http://docs.oracle.com/javase/specs/jls/se7/jls7.pdf
Original comment by anita.v...@inmobi.com
on 30 Oct 2013 at 5:12
[deleted comment]
Hi Atul,
Yes I'm familiar with that stuff. I think we are on the same page.
You raise a good point regarding code reordering if I were to introduce a
separate boolean. Both would then need to be volatile.
So which do you prefer, to leave it as-is with a non-volatile reference, or to
have two volatile primitives instead?
EDIT: actually, not sure if both primitives need to be volatile. Maybe only the
boolean, will confirm.
Original comment by ni...@npgall.com
on 30 Oct 2013 at 5:47
[deleted comment]
It won't be any better than non-volatile Integer. The fastest performance will
be to use String like implementation of using just 'int'.
(non-volatile) int hash;
hashCode() {
int h = this.hash;
if (h == 0) {
h = calcHashCode();
}
if (h == 0) {
h = 1; // Additional optimization..
}
hash = h;
return h;
}
Original comment by anita.v...@inmobi.com
on 30 Oct 2013 at 8:00
Updated in trunk. I liked your idea of coalescing to a non-zero code, if zero
is calculated...
@Override
public int hashCode() {
// Lazy calculate and cache hashCode...
int h = this.hashCode;
if (h == 0) { // hashCode not cached
h = calcHashCode();
if (h == 0) { // 0 is normally a valid hashCode, coalesce with another...
h = -976419167; // was randomly chosen
}
this.hashCode = h; // cache hashCode
}
return h;
}
Original comment by ni...@npgall.com
on 6 Nov 2013 at 10:24
Fixed in release 1.2.4. Many thanks!
Original comment by ni...@npgall.com
on 22 Nov 2013 at 10:02
Original issue reported on code.google.com by
anita.v...@inmobi.com
on 28 Oct 2013 at 4:06