port statistics and to_tsv.py to bigquery - Githubissues

snarfed / bridgy

📣 Connects your web site to social media. Likes, retweets, mentions, cross-posting, and more...

https://brid.gy

Creative Commons Zero v1.0 Universal

711 stars 52 forks source link

port statistics and to_tsv.py to bigquery #715

Closed snarfed closed 6 years ago

snarfed commented 7 years ago

we currently use to_tsv.py and [a google spreadsheet]() to generate our periodic usage and growth stats, e.g. https://snarfed.org/2016-06-06_bridgy-stats-update-2

...but we've outgrown it. we have ~25GB of data now, so downloading it alone is a bit prohibitive, and processing it afterward takes ~4GB memory and >6h on a 2014 MBP. ugh.

fortunately google has solved this for us. we can load the existing datastore backups into bigquery, then port to_tsv.py to a handful of SQL queries. we may even get better viz and charts for free. yahtzee!

snarfed commented 7 years ago

loading into bigquery would help provide a recent request from @tantek: find the most webmentioned posts, both of all time and ongoing. https://chat.indieweb.org/2017-08-16#t1502909939532000

snarfed commented 7 years ago

i tried to load the datastore backups for the Response entity into bigquery and hit a couple roadblocks.

first, the bigquery docs claim it supports wildcards in cloud storage URIs, e.g. gs://brid-gy.appspot.com/weekly/datastore_backup_full_2017_08_06_Response/15677859164382301757551D37199F8/*, but when i tried, it failed with "not found". details in https://bigquery.cloud.google.com/jobs/brid-gy .

second, when i tried loading a single file (output-0), i hit the size limit. app engine datastore backup generates 30-65M files, but bigquery's file limit is 10MB. got this error: _Backup_info file 'gs://brid-gy.appspot.com/weekly/datastore_backup_full_2017_08_06Response/15677859164382301757551D37199F8/output-0' is too big: 68780032 bytes. Maximum allowed size is: 10485760 bytes (error code: invalid), linked to https://cloud.google.com/bigquery/troubleshooting-errors#invalid .

snarfed commented 7 years ago

aha, i needed to load the .backup_info file: https://cloud.google.com/bigquery/loading-data-cloud-datastore#webui

e.g. gs://brid-gy.appspot.com/weekly/aglzfmJyaWQtZ3lyQQsSHF9BRV9EYXRhc3RvcmVBZG1pbl9PcGVyYXRpb24Yqd-8BQwLEhZfQUVfQmFja3VwX0luZm9ybWF0aW9uGAEM.Response.backup_info

snarfed commented 7 years ago

ok, Responses are loaded, and i opened the dataset up to the public: https://bigquery.cloud.google.com/table/brid-gy:datastore.Response

i may or may not maintain it. we'll see.

snarfed commented 6 years ago

loaded the rest of the data (accounts, publishes, etc.) into bigquery from an 11/19 snapshot, and started trying google data studio as a BI tool: https://datastudio.google.com/open/10lOfwjq6844GqRmqtfv-2nZc2F9BZyWq

overall, it's improved a lot since i tried it this past summer for indie map...but it's still not quite ready for prime time.

it chokes on big time series datasets, e.g. the 7+M Response entities: The server encountered an internal error and was unable to complete your request. Error ID: ce0acec0
it can't add calculated fields (e.g. COUNT) on bigquery views or custom queries, which i kind of need for merging all of the different source (account) types. i'd load them all into the same bigquery table, but bigquery's datastore import doesn't support that.

i've sent these as feedback and filed them in the issue tracker: https://issuetracker.google.com/issues/69639296 https://issuetracker.google.com/issues/69639425

snarfed commented 6 years ago

next candidates:

https://community.modeanalytics.com/
https://www.qlik.com/
local https://metabase.com/, publish graph images

snarfed commented 6 years ago

so, qlik's bigquery connector requires their desktop app, and mode only had a free trial, not a permanent free tier, and their trial ended.

so i'm now trying out local metabase. working pretty well so far.

snarfed commented 6 years ago

spoke too soon. metabase has been giving me a hell of a time dealing with bigquery views (e.g. a combined Sources view with all the different silo account types), legacy vs standard SQL (especially on views), and graphing multiple things on the same chart.

next idea: use bigquery to populate the current spreadsheet, and continue to use that to generate the graphs.

snarfed commented 6 years ago

worked! the spreadsheet now has updated data and graphs.

will document the new process soon. in the meantime, here's the monster bigquery query that populates the spreadsheet data:

WITH
days AS (SELECT * FROM UNNEST(
  GENERATE_DATE_ARRAY('2013-12-01', (SELECT MAX(CAST(created AS date)) FROM datastore.Response))) AS day),
s as (SELECT cast(created as date) AS day, __key__.kind, count(*) as num FROM datastore.Sources GROUP BY day, kind),
bl as (SELECT * FROM s WHERE kind = 'Blogger'),
fb as (SELECT * FROM s WHERE kind = 'FacebookPage'),
fl as (SELECT * FROM s WHERE kind = 'Flickr'),
gp as (SELECT * FROM s WHERE kind = 'GooglePlusPage'),
ig as (SELECT * FROM s WHERE kind = 'Instagram'),
me as (SELECT * FROM s WHERE kind = 'Medium'),
tu as (SELECT * FROM s WHERE kind = 'Tumblr'),
tw as (SELECT * FROM s WHERE kind = 'Twitter'),
wp as (SELECT * FROM s WHERE kind = 'WordPress'),
resp as (SELECT cast(created as date) AS day, count(*) as num FROM datastore.Response GROUP BY day),
bp as (SELECT cast(created as date) AS day, count(*) as num FROM datastore.BlogPost GROUP BY day),
pub as (SELECT cast(created as date) AS day, count(*) as num FROM datastore.Publish GROUP BY day),
bw as (SELECT cast(created as date) AS day, count(*) as num FROM datastore.BlogWebmention GROUP BY day),

resps_and_bps as (
  SELECT created, sent, unsent, error, failed, skipped FROM datastore.Response
  UNION ALL
  SELECT created, sent, unsent, error, failed, skipped FROM datastore.BlogPost
),
wms as (
  SELECT cast(created as date) AS day, count(*) as num,
  SUM(ARRAY_LENGTH(sent)) AS sent,
  SUM(ARRAY_LENGTH(sent) + ARRAY_LENGTH(unsent) + ARRAY_LENGTH(error) + ARRAY_LENGTH(failed) + ARRAY_LENGTH(skipped)) AS links
  FROM resps_and_bps GROUP BY day),
features as (
  SELECT cast(created as date) AS day, feat, count(*) as num
  FROM datastore.Sources s, s.features feat GROUP BY feat, day),
listen as (SELECT * FROM features WHERE feat = 'listen'),
publish as (SELECT * FROM features WHERE feat = 'publish'),
webmention as (SELECT * FROM features WHERE feat = 'webmention'),

domains as (
  SELECT DISTINCT NET.HOST(s) AS domain, MIN(CAST(rbs.created AS date)) AS day
  FROM resps_and_bps rbs, rbs.sent s GROUP BY domain ORDER BY day),
domain_counts as (SELECT day, COUNT(*) AS num FROM domains GROUP BY day)

SELECT day,
  SUM(bl.num) OVER win AS Blogger,
  SUM(fb.num) OVER win AS FacebookPage,
  SUM(fl.num) OVER win AS Flickr,
  SUM(gp.num) OVER win AS GooglePlusPage,
  SUM(ig.num) OVER win AS Instagram,
  SUM(me.num) OVER win AS Medium,
  SUM(tu.num) OVER win AS Tumblr,
  SUM(tw.num) OVER win AS Twitter,
  SUM(wp.num) OVER win AS WordPress,
  SUM(resp.num) OVER win AS Response,
  SUM(bp.num) OVER win AS BlogPost,
  SUM(pub.num) OVER win AS Publish,
  SUM(bw.num) OVER win AS BlogWebmention,
  SUM(listen.num) OVER win AS f_listen,
  SUM(publish.num) OVER win AS f_publish,
  SUM(webmention.num) OVER win AS f_blog_webmention,
  SUM(wms.links) OVER win AS links,
  SUM(wms.sent + bp.num) OVER win AS webmentions,
  SUM(domain_counts.num) OVER win AS domains

FROM days
LEFT JOIN bl USING(day)
LEFT JOIN fb USING(day)
LEFT JOIN fl USING(day)
LEFT JOIN gp USING(day)
LEFT JOIN ig USING(day)
LEFT JOIN me USING(day)
LEFT JOIN tu USING(day)
LEFT JOIN tw USING(day)
LEFT JOIN wp USING(day)
LEFT JOIN resp USING(day)
LEFT JOIN bp USING(day)
LEFT JOIN pub USING(day)
LEFT JOIN bw USING(day)
LEFT JOIN listen USING(day)
LEFT JOIN publish USING(day)
LEFT JOIN wms USING(day)
LEFT JOIN webmention USING(day)
LEFT JOIN domain_counts USING(day)

WINDOW win AS (ORDER BY day ROWS UNBOUNDED PRECEDING)
ORDER BY day

snarfed commented 6 years ago

done! details in the readme. used it for this new stats roundup.