Open wlue opened 11 years ago
That's the bug line. The HeavyHitters object is considered different if it is the same value but different count.
Actually this is busted:
the problem is, it it assume the old count is known, but the appoximation may change it, and so we need to be more careful. I think what we need to do is just keep a set of keys, and not their counts at all. I don't think you need the sorted set.
Need a failing unit test here and then a fix. I think this might still be broken.
When adding two CountMinSketch objects, is it intentional that the heavy hitters are duplicated when they're summed using the monoid? The
heavyHitters
method on CMS is correct because it filters out. Here's an example where I add two CMS objects, and inspect the heavy hitters internal representation.I would expect that the internal representation of
hhs
would remove the duplicate value. Since I serialize the data, it becomes a major waste of space storing a bunch of old values. For now, I just remove duplicate 'item' HeavyHitter keys when serialized, but I was wondering if this was intentional, since I can see that being the case for faster summing.