Counting uniques faster in BigQuery with HyperLogLog++ - Githubissues

manuzhang / read-it-now

Don't read it later; read it now

4 stars 0 forks source link

Counting uniques faster in BigQuery with HyperLogLog++ #2

Open manuzhang opened 7 years ago

manuzhang commented 7 years ago

Article

https://cloud.google.com/blog/big-data/2017/07/counting-uniques-faster-in-bigquery-with-hyperloglog

Notes

HyperLogLog is an algorithm to approximately count distinct elements.

The HyperLogLog algorithm is able to estimate cardinalities of > 10^9 with a typical error rate of 2%, using 1.5 kB of memory
Google Cloud's data warehouse, BigQuery uses HyperLogLog++ which provides a bunch of improvements like better accuracy for very large cardinalities (> 1 billion).

Examples

How many unique users did GitHub have in 2016?

#standardSQL
SELECT COUNT(DISTINCT actor.login) exact_cnt
FROM `githubarchive.year.2016`
#6,610,026 (4.1s elapsed, 3.39 GB processed, 320,825,029 rows scanned)

#standardSQL
SELECT APPROX_COUNT_DISTINCT(actor.login) approx_cnt
FROM `githubarchive.year.2016`
#6,643,627 (2.6s elapsed, 3.39 GB processed, 320,825,029 rows scanned)

Distinct count on more than 3 billion Reddit comments

#standardSQL
SELECT COUNT(DISTINCT id) exact
FROM `fh-bigquery.reddit_comments.20*`
#3,168,770,564, 27.9s elapsed, 27.2 GB processed

#standardSQL
SELECT HLL_COUNT.MERGE(sketch) approx
FROM (
 SELECT HLL_COUNT.INIT(id) sketch
 FROM `fh-bigquery.reddit_comments.20*`
)
#3,161,955,636, 5.7s elapsed, 27.2 GB processed

BigQuery standard SQL is compliant with the SQL 2011 standard and has extensions that support querying nested and repeated data