Quadratic scaling behaviour when deleting elements

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?

sparse_hash_set<whatever> myTable;
// insert 200.000 elements in myTable

while ( !myTable.empty) {
       sparse_hash_set<whatever>::iterator it = myTable.begin();
       myTable.erase(it);
}

What is the expected output? What do you see instead?

I expect this code to finish in the blink of an eye.  Instead, it takes 
minutes.  Change 200k to 5M, and it won't finish in a day.

If you call resize(0) from time to time, speed improves, but performance is 
still unacceptably low.

This is a minimalistic example that I came up with.  The error is basically: if 
you have many deleted elements in the sparse_hash_XXXX, iterator operations are 
no longer performed in constant time.

What version of the product are you using? On what operating system?

version 1.9.  Linux, gcc 4.4.4

Please provide any additional information below.

A quick look in the code learns that this all has to do with the way deleted 
elements are handled (empty_key).  Many functions have while() loops 
implemented in them, which need to iterate through all deleted elements.

Original issue reported on code.google.com by jan.fost...@gmail.com on 27 Oct 2010 at 9:16

GoogleCodeExporter commented 9 years ago

Yes, you are right, iterating through a sparsetable is not technically constant 
time per increment, but linear in the number of buckets (more specifically, the 
number of sparse-groups).  When there are no deleted items in a sparsetable, 
the number of buckets is proportional to the number of items, so iterating 
through an entire table is amortized constant time.  When there are many 
deleted items, you don't even get amortized constant time.

The suggestion I normally give is to resize the hashtable after doing a lot of 
deletions (or insert something new, which automatically will resize for you) -- 
then iterating is constant time again.

But I admit this depends on a model where someone iterates through an entire 
map, rather than just calling begin() repeatedly.  I have to admit, that is an 
idiom I haven't seen before.  Why do you keep calling begin() instead of 
keeping track of the old it and starting the search again from there?  Is it 
because you're worried erase() invalidates iterators?

That's not true for sparse_hash_map, so you can do a naive erase-loop and have 
it work ok.  That should avoid the problems you're seeing.

Fixing the code to be faster on the idiom you show here would slow down the 
code in the general case.  There may be some tweaks I could make, but really I 
think the better approach would be to write what you're trying to do in a 
different way.

Original comment by csilv...@gmail.com on 27 Oct 2010 at 7:09

Added labels: Priority-Medium, Type-Defect

GoogleCodeExporter commented 9 years ago

Any more word on this?  I'm looking to make a new release very soon, and don't 
currently plan on taking any action on this issue before then.  Hopefully 
slight changes to the application-level code can work around any problems here.

Original comment by csilv...@gmail.com on 21 Jan 2011 at 12:37

GoogleCodeExporter commented 9 years ago

Indeed, I managed to work around this issue.  This might indeed be considered 
low-priority.

Original comment by jfost...@intec.ugent.be on 10 May 2011 at 12:10

GoogleCodeExporter commented 9 years ago

Thanks, I'll keep the bug report open but lower the priority.

Original comment by csilv...@gmail.com on 10 May 2011 at 7:22

Added labels: Priority-Low
Removed labels: Priority-Medium

GoogleCodeExporter commented 9 years ago

I've been thinking more about this.  If I wanted to keep iterating over a 
hash_map to stay amortized constant time, the only way I can think of is to 
resize the hashtable at a call to begin().  (Or, more likely, to resize only if 
the number of deleted elements is X% of the number of non-deleted elements.)  
This complicates the simple rule, "We only resize via resize() or insert()," 
but probably that's ok.  (The rule's not so simple, anyway; we also resize() 
when saving to disk, for instance.)

The bigger problem is that begin() is (or can be) a const method, so it's not 
reasonable to resize then.  Even if people used the non-const begin(), they 
would probably naively expect it to be safe to do so from multiple threads 
without locking.  So while appealing, I don't think resizing at begin() is a 
workable idea.

The other option is to store extra state in the sparsetable to make it faster 
to ignore deleted elements when iterating.  That's probably too expensive to do 
in general -- at least, I can't think of a good way -- but perhaps not too 
expensive to do it just for the special case of matching begin().  It would, 
however, increase the size of sparse_hash_map, since we'd need to store at 
least one word of data for this new state.  Since some applications I know of 
use hundreds or thousands of sparse_hash_maps, I hate to increase the class 
size without a demonstrated benefit.

Overall, I think this idiom of repeatedly calling begin() is a pretty rare one; 
at least, I've never seen it before.  So I've decided not to make any code 
changes to speed that case up.

I wish there was some way to tell the programmer to put in a call to resize() 
after doing lots of delete()s, if they don't plan to follow them up by an 
insert().  That would probably help to speed things up more than any 
begin()-hacks I do, in any case.

Original comment by csilv...@gmail.com on 26 Aug 2011 at 1:07

Changed state: WontFix

gulgi / sparsehash

Quadratic scaling behaviour when deleting elements #62