joshday / OnlineStats.jl

⚡ Single-pass algorithms for statistics
https://joshday.github.io/OnlineStats.jl/latest/
MIT License
831 stars 62 forks source link

hyperloglog sketches #253

Closed spinkney closed 2 years ago

spinkney commented 2 years ago

Is it possible to cache sketches of sets which you then want to do count distinct computations of? For example, I want users to be able to compose any number of groups which have each individually been passed through a hyperloglog fit. It's the same hyperloglog storage but each set is stored as a "sketch". Then I can get count distinct of each sketch or I can sum any combination of sketches to get the estimated count distinct of the union of those sketches.

This is relevant for audience estimation in digital and television advertising. See this google paper https://storage.googleapis.com/pub-tools-public-publication-data/pdf/54a28925b11e05b1d8d1cc5c03f171666dc77e8e.pdf.

spinkney commented 2 years ago

I believe the merge function will work which I didn't see in the documentation (let me know if I happened to miss it). Found by looking at the source code.

joshday commented 2 years ago

merge! is one of the core pieces of OnlineStats. It's covered in the first page of the docs.