manuzhang / read-it-now

Don't read it later; read it now
4 stars 0 forks source link

Counting uniques faster in BigQuery with HyperLogLog++ #2

Open manuzhang opened 7 years ago

manuzhang commented 7 years ago

Article

Notes

Examples

How many unique users did GitHub have in 2016?

#standardSQL
SELECT COUNT(DISTINCT actor.login) exact_cnt
FROM `githubarchive.year.2016`
#6,610,026 (4.1s elapsed, 3.39 GB processed, 320,825,029 rows scanned)
#standardSQL
SELECT APPROX_COUNT_DISTINCT(actor.login) approx_cnt
FROM `githubarchive.year.2016`
#6,643,627 (2.6s elapsed, 3.39 GB processed, 320,825,029 rows scanned)

Distinct count on more than 3 billion Reddit comments

#standardSQL
SELECT COUNT(DISTINCT id) exact
FROM `fh-bigquery.reddit_comments.20*`
#3,168,770,564, 27.9s elapsed, 27.2 GB processed
#standardSQL
SELECT HLL_COUNT.MERGE(sketch) approx
FROM (
 SELECT HLL_COUNT.INIT(id) sketch
 FROM `fh-bigquery.reddit_comments.20*`
)
#3,161,955,636, 5.7s elapsed, 27.2 GB processed

BigQuery standard SQL is compliant with the SQL 2011 standard and has extensions that support querying nested and repeated data