haskell / containers

Assorted concrete container types
https://hackage.haskell.org/package/containers
314 stars 177 forks source link

Complexity of IntMap-IntMap operations #1011

Open meooow25 opened 3 weeks ago

meooow25 commented 3 weeks ago

Recently, something made me notice the documented complexity of IntMap-IntMap operations, and I realized that I have never really thought about it.

Let's take IntMap union, for instance.

https://github.com/haskell/containers/blob/5d4bc2ed62475062a09d76a037ca7a72ad3cbd45/containers/src/Data/IntMap/Internal.hs#L1090

The claim is $O(n + m)$, which is not hard to verify. We match on some (in a bad case all) of the constructors of the two maps, of which there are a linear number, and whenever we do so we perform a constant amount of work. Hence $O(n + m)$.

But we can try to find a tighter bound.

Consider that $m \le n$, and we think of the smaller map. In the worst case, we might visit all constructors of the map, which makes $m$ a lower bound. The work we do in addition to this, is whenever we reach forks in the bigger map which are not in the smaller map. How many such forks can there be? The worst case is that for every leaf in the smaller map, the path to the root from a leaf hits every possible fork (i.e. forks at every power of 2). So, we need the number of nodes of the subset of the complete binary tree of all ints which forms paths from the root to the leaf keys of the map. For a map of size $m$ and the complete binary tree having height $W$, this is $O(m(W - \log m))$. Since we still have the $n + m$ upper bound as before, the overall complexity for $m \le n$ is

$$O(\min(n, m(W - \log m)))$$

I can note two interesting things about it:

  1. If we consider $m$ as a small constant, the complexity is $O(\min(n,W))$, which checks out to be the complexity we have for insert.
  2. The second term, $m (W - \log m) = m \log(\frac{2^W}{m})$, is not unlike the complexity for Map union, $O(m \log ( \frac{n}{m} ))$.

If this looks correct, I would like to update the documentation to reflect it.

Why does it matter? After all, the current bound is not incorrect.
For accuracy, and also it would help better estimate complexity in derived scenarios. For instance, consider foldl' union empty intmaps on $r$ IntMaps of size $m$ each. With the current bound, we would calculate the complexity to be $O(r^2 m)$. With the new bound, we get $O(rm(W - \log m))$. We went from quadratic in $r$ to linear, which is a very useful distinction.