vigna / fastutil

fastutil extends the Java™ Collections Framework by providing type-specific maps, sets, lists and queues.
Apache License 2.0
1.74k stars 194 forks source link

ObjectOpenHashSet performs removeAll() with O(M*N) complexity #320

Open vladap opened 1 month ago

vladap commented 1 month ago

ObjectOpenHashSet implementation inherits removeAll(Collection<?>) from the AbstractCollection which iterates over the ObjectOpenHashSet and calls 'contains(elem)' on the collection of elements to remove. This contains is not O(1), except when it is a Set. Typically it is O(N) which results in O(M*N) complexity.

In general all ObjectSet implementations are affected if contains implementation has better complexity than O(N).

Non object implementations avoid that by overriding removeAll which internally uses overridden rem/remove implementation.

Another point is that implementation is missing optimization to iterate over a smaller collection and to call contains on a larger one. BUT ONLY for the case when contains has better complexity than O(N). The closest interface to test for is a Set which still might be O(N). But it would improve performance in some cases and the worst case is no improvement (just an iteration swap).

Optimization must avoid the same bug as in JDK AbstractSet.removeAll implementation which can switch to iterate over a Set even when contains on the other collection is slower. Because this problem has never been fixed in JDK other libraries have workarounds for it and can be taken as an inspiration, e.g. Guava in their Sets class introduces ImprovedAbstractSet.

Could be fastutil similarly improved? Any suggestions/ideas?

Guava issue 1013: performance problem in LinkedHashMultimap

Guava ImprovedAbstractSet implementation

Guava removeAll(Set<?> set, Iterator<?> collection)

Guava removeAll(Set<?> set, Iterator<?> iterator)

vigna commented 1 month ago

I guess this can be done. Still, I think that either you do APIs with guaranteed performance (Boost-style, for example), or by making "smarter and smarter" (but without proper information) the code you end up in situations that make extremely difficult to diagnose performance problems. For example, it is true that is probably better to iterate over the small collection, but if you have, for example, two hash sets with significantly different load factor the opposite might be true.

Type-specific collections use an inverted logic by pure chance. It might be worse than the Java logic in some cases.

Possibly one could first check if only one of the two collections is a set, in which case call contains on that one. And if they are both sets one could iterate on the small one.

Still, always thought of addAll(), removeAll() as convenience methods for tests and small examples—if I'm manipulating large amounts of data I'll write a removal loop optimized for the current setup.