Publish raw data of repositories launched #789

Open yuvipanda opened 6 years ago

With the progress made on #97, we are now close to publishing raw information on repositories launched. This contains the following information:

Timestamp of repository launched (possibly truncated to minute resolution)
Provider of repo launched (GitHub / GitLab / etc)
Repo name
Commit hash / branch run

This lets us (and others!) make more dashboards and run analysis on repository usage on mybinder.org. Something like https://tools.wmflabs.org/pageviews/ sounds awesome :)

This doesn't include any information about our users - only about the repositories being launched. Possible privacy issue here is that we might 'leak' a user's repo that they are just using for themselves. However, we only support public repos already, so IMO this is not a concern. We say this too in our docs already.

This issue should track our work in making this info public. I'd also want to check that this works with what we'd like our privacy policy would be.

/cc @minrk @willingc @betatim @choldgraf @jzf2101 what do you think?

I can't directly think of what we could leak about individual users which is what I'd worry about. A repo being used on mybinder.org doesn't seem like information that needs protecting, unless it tells you something about an individual human.

My plan now is to publish this every day as a JSON (one entry per line) file, with timestamps truncated to per-minute resolution (since that's the only bit of info that's related to a user action in any form).

Sounds good to me!

What will the publishing workflow look like and what do you think of (eventually) transitioning this to being a live stream of events (with a limited history?). I was thinking a bit like the twitter firehose. If we could combine daily digests and live stream into one service that would be neat.

ok, I've done a bunch of work that lets us build images on demand in this repo, and push them to GCR with chartpress. https://github.com/jupyterhub/mybinder.org-deploy/tree/master/images/events-archiver is the beginning of the script that'll do the archiving.

Next steps:

[x] Write code that reads events and puts them in storage
[x] Run it in a cron

And see how that goes!

This image building infra should also be very useful for other things.

I've code that does this, but stackdriver read limits are pretty low (1 per second across the whole project). I've instead set up exports from stackdriver to cloud storage (https://cloud.google.com/logging/docs/export/using_exported_logs#gcs-overview), and the script can read from this, post process and export it as processed public files.

With a large number of PRs ending in https://github.com/jupyterhub/mybinder.org-deploy/pull/817, most of this is done! https://archive.analytics.staging.mybinder.org/ exists for staging, and shortly https://archive.analytics.mybinder.org/ will exist for prod!

Things left to do:

[x] Events from last few hours of a day are now missed, because we run the archiver only every few hours.
[x] Add Piwik tracking code so we have some sense of who is visitin the page.
[x] Write docs on the structure of the files.

We shouldn't publicize this until these things are done, but they should all be done very soon.

I fetched a file and tried to open it with: json.load(open("events-2018-11-06.jsonl")) because "what is this jsonl thing? let's try and open it" and it fails :-/

What is the trade-off between using jsonl and plain json with a set of [] around the whole file? Making it easy to open the files is going to be key if we want lots of people to build on them. This makes me think josn.load and the pandas equivalent should "just work". If we stick with jsonl we should supply a snippet for how to read the files. Without guidance/googling I am now thinking I will have to iterate over each line, call json.loads on it and collect things into a list like that.

Yep, lotta docs to be written. Pandas.load_json does work with these files I think! If you wanna read it in plain Python you do have to loop over every line.

The big advantage is that you can stream these, and you can not do that with pure json files - you must read the entire thing into memory before you can do anything with them. IMO, this makes enough of an advantage for it to be worth it. This is how json / structured logging works everywhere, for example. Tools like jq work very well with it

See http://jsonlines.org/ for more info. Googling for 'json lines' also produces a lot of info.

I'll be writing a lot of documentation today.

pandas.read_json(url, lines=True) does seem to have problems with the nesting, however. I'm gonna de-nest the structure.

@yuvipanda wanna hack together on a documentation PR sometime this week?

The stream seems like a good point and means you could concatenate lots of days into one file easily. Maybe we put pandas.read_json(..., lines=True) on the index page as a pointer? That would have made me find/use it.

@betatim yep, we should have code samples in at least Python and JS.

@choldgraf sure! I think I can write one up later today, and we can iterate from there.

Documentation now at https://mybinder-sre.readthedocs.io/en/latest/analytics/events-archive.html. This is linked to from https://archive.analytics.mybinder.org/.

Instead of using piwik, I'm going to get stackdriver to send the logs from the nginx proxy serving https://archive.analytics.mybinder.org/ to GCS for storage. This lets us get better metrics on how people are fetching and using this data.

EXCITING! Now that the data engineering is all complete, we 'just' need someone to do something cool with this data.

There is now a stackdriver sink in binder-prod called events-archive-access-logs that is archiving nginx logs from archive.analytics.mybinder.org to a GCS bucket named mybinder-events-archive-access-logs. We can use this later to do analytics on our analytics.

jupyterhub / mybinder.org-deploy

Publish raw data of repositories launched #789