timbray / topfew

Finds the field values (or combinations of values) which appear most often in a stream of records.
GNU General Public License v3.0
186 stars 6 forks source link

experiment with alternate counter scheme #16

Closed elindsey closed 5 months ago

elindsey commented 5 months ago

Feel free to close this out with no action - just something I was toying with locally and thought I'd share. 🙂

This is an alternate version of counter that relies on packing a key ID into the counts. The upside is no bookkeeping/juggling of two maps, no merging, no pointer indirection/less sparseness, and no use of the sort Func form. The downside is sharing bits between counter and keyspace, so certain datasets (very very high counts and very very high number of unique keys) would be representable in the current version and not in this version. This scenario could be detected and have a fallback, but current code doesn't do it (didn't apply to my use case, kind of an edge case).

timbray commented 5 months ago

I have a 3.18G test file that I use for benchmarking. I checked out your branch and don't seem to observe any significant performance delta - maybe just slightly slower. Could post a copy of the test file for you to grab if you want.

I haven't actually worked out yet how yours works yet, but the narrative sounds cool. Whatever happens, I will go back and figure it out

In any case, what's your opinion? Is there a case to be made for a serious effort to adopt this proposal? Your description of the trade-offs was interesting but didn't discuss practicalities such as maintainability or memory footprint or suitability for particular kinds of computers or whatever. Did you do this just for fun (I approve) or to address a particular use case?

In any case, if you wanted to write this up somewhere, I'd probably link to it and send a few thousand pairs of eyes to have a closer look.

elindsey commented 5 months ago

Could post a copy of the test file for you to grab if you want.

That'd be great! If you don't mind I'll defer on the other questions until I've had a chance to peek a bit more at that test file and write a benchmark. This was purely for fun, but I may have sent it off half-baked. 🙂

timbray commented 5 months ago

Never mind: Found your email, sent URL of test file.

elindsey commented 5 months ago

Going ahead and closing this out - it only really has benefit for large values of -n, but that's a less common use. At smaller values it may be negative because it ends up with a larger final sort. It would be neat if the two techniques could be combined in some form, but haven't thought of a way outside of coarse heuristics.