matomo-org / matomo

Empowering People Ethically with the leading open source alternative to Google Analytics that gives you full control over your data. Matomo lets you easily collect data from websites & apps and visualise this data and extract insights. Privacy is built-in. Liberating Web Analytics. Star us on Github? +1. And we love Pull Requests!
https://matomo.org/
GNU General Public License v3.0
19.86k stars 2.65k forks source link

Big Data Cardinality: fast unique visitor counting for large instances using HyperLogLog #6212

Open diosmosis opened 10 years ago

diosmosis commented 10 years ago

HyperLogLog is an algorithm that provides 97% accuracy on cardinality counts (ie, unique visitors), while using a very small memory footprint. We could add support for this algorithm within Piwik to not only provide fast cardinality for large numbers of visits, but also we can use it to provide a unique visitor count for any period. HyperLogLog results can themselves be aggregated, so they can be stored in archive tables as blobs.

More info:

mattab commented 10 years ago

I also heard of Bloom filters doing good job: https://en.wikipedia.org/wiki/Bloom_filter

diosmosis commented 10 years ago

According to the second article, bloom filters will take up more space (though provide more accuracy) and the results cannot be merged so the process cannot be done in parallel across several machines.

mattab commented 9 years ago

btw there is now HyperLogLog data structure in Redis http://antirez.com/news/75

update 2021

simple explanation of how it works https://stackoverflow.com/a/12734343/3759928

also from https://segment.com/blog/scaling-up-reporting-on-high-cardinality-metrics/

MySQL has UDFs (user defined functions) that we could use for this, but we use MySQL on AWS, and from my research, there doesn’t seem to be a way to use UDFs on Aurora, or RDS. PostgreSQL on the other hand, has an extension called postgresql-hll, which is available on PostgresSQL RDS.

diosmosis commented 9 years ago

Apparently possible to implement hyperloglog in pure SQL: https://www.periscope.io/blog/hyperloglog-in-pure-sql.html

diosmosis commented 3 years ago

We recently ran some tests with hyperloglog in matomo to use instead of COUNT(DISTINCT idvisitor) and noticed in every case it was far more efficient.

Details about the test:

Test results:

Conclusions:

All of this depends on whether large users would want exact numbers or a close enough estimate, but it might make Matomo usable for very high traffic websites.

tsteur commented 3 years ago

Great findings @diosmosis 💯

From UI perspective when enabling this the biggest challenge will be making it clear to users when accurate numbers are used and when estimates. So if someone was to enable these estimates, then we'd need to change the metric/column name to "Estimated Unique Visitors" and update the description to mention the estimate and it's error rate etc to prevent users getting confused or reporting errors that aggregating the unique visitors from lower periods isn't the same etc.