Closed GoogleCodeExporter closed 9 years ago
OK, I tracked this down. The problem is one that I had thought about before,
but never seen in action until now: I combine the logic for find() and
insert(). As a result, the insert() call takes a lot more time for this
test-case than it ought to, because you've created an access pattern where
lookups are much more expensive than inserts.
I don't expect such an access pattern to occur very often in real-life data, so
I don't think this is a code-red situation. But it is worth cleaning up: the
hashtable code is doing extra work for no good reason. I'll look to get that
fix into the next release.
Original comment by csilv...@gmail.com
on 28 Jun 2011 at 9:08
btw, was this motivated by an actual use-case you were seeing? Or did you
happen to just be trying out various benchmark-type codes?
Original comment by csilv...@gmail.com
on 28 Jun 2011 at 9:09
no real use case behind it, this example is from my bench for different hash
maps implementations. but, there are many applications which do insert just
after erase (although not erase(i), insert(i)), e.g. when you keep list of last
used elements (cache).
but this case is really strange, as I understand from the dense_hash algorithm
description it shouldn't make any problem: mark element as deleted, do find,
check that the bucket deleted and put new element into it. all actions must be
very fast for int.
P.S. anyway, thanks for the lib.
Original comment by bulovya...@gmail.com
on 28 Jun 2011 at 11:04
Hmm, on second thought this is not easy to fix. When we see the bucket is
deleted, we only know we can insert the item there if we know the item doesn't
already exist in the hashtable. Because we do open chaining, we need to go all
the way to the end of the chain to tell this.
Your example is a pathological case for this: it requires all the items to be
hashed right next to each other (which the identity hash does on this input),
and requires erasing and re-inserting the same element. Any deviation from
that -- for instance, a real life top-n cache, where the item being deleted is
different from the one being inserted -- would not have this bad behavior.
There are various hacks that would improve this benchmark, but not performance
in real-life cases. Without any indication that real apps are suffering from
this, I'll probably leave it as-is. I'll keep the bug open, though, but lower
the priority.
Original comment by csilv...@gmail.com
on 29 Jun 2011 at 2:27
For future reference, here is the test I added to time_hash_map.cc to document
the slowness here:
template<class MapType>
static void time_map_reverse_toggle(int iters) {
MapType set;
Rusage t;
int i;
// Just like toggle(), but we do erase+insert rather than insert+erase.
// We prime the hashtable outside the benchmarking.
set.resize(iters);
for (i = 0; i < iters; i++) {
set[i] = i+1;
}
const size_t start = CurrentMemoryUsage();
t.Reset();
for (i = 0; i < iters; i++) {
set.erase(i);
set[i] = i+1;
}
double ut = t.UserTime();
const size_t finish = CurrentMemoryUsage();
report("map_reverse_toggle", ut, iters, start, finish);
}
Original comment by csilv...@gmail.com
on 29 Jun 2011 at 3:27
just want to mention that the problem isn't related to the serial access
pattern and hashing conclusions.
kind of this also works slow:
for (int i = 0; i < size; ++i) set.insert(i);
for (int i = 0; i < size; ++i) { set.erase(i); set.insert((i+1)*(i+1) + size); }
Original comment by bulovya...@gmail.com
on 29 Jun 2011 at 9:29
Hmm, I would expect we'd do better on this than we do on your first case, but
maybe not totally great.
Basically, any time you have a really uneven distribution of hash values,
you'll see bad results. That's what happens here: we have 2^k buckets (for
some k), but all the hash values are in the range 1..size, and there are none
in the range size..2^k -- at least, for your initial insert.
If you used a non-trivial hashing function -- even just a simple bit-mix --
performance here would be much better. In general, dense_hash_* and
sparse_hash_* assume a good hashing function. Often, the identity hash is not
good.
Original comment by csilv...@gmail.com
on 29 Jun 2011 at 8:34
I just tested the code with MCT (https://launchpad.net/libmct) and it appears
to work about at the speed of stdlib implementation (but note that rehash(size)
for closed hashing is not very appropriate, you need sth. like rehash(2 * size)
instead).
I have the following in the library code:
start_probing (std::size_t hash) const
{
// We require that number of buckets is a power of 2. To compensate, we involve a
// prime number here; this usually improves speed with particularly bad hash
// function choices.
return (hash * size_t_constants <>::HUGE_PRIME) & (this->num_buckets - 1);
}
Didn't really make many tests in different situations, but this trick seems to
break such pathological cases nicely, while having little effect in "normal"
cases.
Original comment by pogonys...@gmail.com
on 30 Jun 2011 at 10:06
Yeah, there's a philosophical debate here; we had a similar issue come up
recently with respect to hashing pointers (which likewise do poorly with the
identity hash). Multiplying by a prime is a good way of improving a bad hash
function, but it's not free. My feeling is that people who already have a good
hash function shouldn't have to pay the price for that extra multiplication all
the time.
Another way of saying this is: all your tests would do much better if you
replaced the identity hash with a hash function like
return x * HUGE_PRIME
The flip side of this is that users need to know to do that, and need to
recognize that hashtable performance is a bottleneck for them and this is a
good way to fix it. I don't know the best way to do that, but I feel that
trying to do that is better than trying to correct for a bad hash function in
the hashtable code.
Original comment by csilv...@gmail.com
on 30 Jun 2011 at 10:23
Yeah, I also had doubts about this line. However, I considered that there are
two "line of thoughts" when implementing a hash table: one dictates 2^n buckets
(like both SparseHash and MCT use), the other to always use a prime number
(e.g. GLib as far as I remember). So "to compensate" in the comment was
written having that second option in mind: some implementations already do
something similar because they do hash % prime_num_buckets.
Original comment by pogonys...@gmail.com
on 30 Jun 2011 at 10:46
Yeah, hashing by a prime is more expensive than hashing a power of 2. Again, I
see these as all hacks to work around the fact the user has not picked a good
hash function. On the other hand, picking a good hash function can be
difficult, and maybe not everyone shares my views of what a 'good' hash
function is. :-)
For strings, I think it's pretty straightforward actually: you need lots of
mixing, and something like MurmurHash is going to be good for everyone -- it
has all the good properties a hashtable implementation might need. The problem
is for things like numbers: you're tempted to use the identity hash or
something else really fast, which may be a perfectly fine 'shortcut' for some
hashtable implementations but not for others.
I'm torn as to what to do here. I don't want to speed-penalize folks who are
using a 'good' hash function already. But on the other hand, paying for an
extra multiply is cheap compared to the cost of using a bad (for us) hash
function.
I wish I could identify when a hash function was 'bad' and correct for it --
something like adaptive hash evaluation. But I think that would be too slow:
probably slower than just always multiplying. My intuitions about speed are
probably not that great here.
Overall, I'm not yet convinced this problem is widespread enough in the wild to
justify making any changes. I'm going to close the bug WillNotFix, but feel
free to reopen it if you'd like to try to convince me I'm mistaken. :-)
Original comment by csilv...@gmail.com
on 26 Aug 2011 at 1:26
Original issue reported on code.google.com by
bulovya...@gmail.com
on 28 Jun 2011 at 5:39