jupyterlab / frontends-team-compass

A repository for team interaction, syncing, and handling meeting notes across the JupyterLab ecosystem.
https://jupyterlab-team-compass.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
59 stars 30 forks source link

JupyterLab telemetry extension #4

Open ellisonbg opened 5 years ago

ellisonbg commented 5 years ago

Background

For the past ~2 years we have had a number of conversations about building a telemetry system for JupyterLab. By a telemetry system, we mean a system for collecting data about what actions users are performing in JupyterLab, how they are taking those actions (mouse, keyboard shortcut, button), and when. Telemetry data serves a number of purposes for organizations deploying Jupyter: i) operational and security monitoring, ii) intrusion detection, iii) compliance auditing, and iv) a-posteriori analysis of the platform’s usage and design–i.e. as a UX research tool.

Tenets

There are certainly ethical and legal questions around telemetry systems. To address these, I propose the following tenets of the telemetry system:

Current status

@ian-r-rose has created an exploratory JupyterLab telemetry extensions here:

https://github.com/ian-r-rose/jupyterlab-telemetry

This extension shows some of the major pieces requires:

It may also make sense to build differential privacy into our telemetry collection code to protect users. This isn't appropriate for all deployments, but in many cases it would be extremely helpful.

Next steps

@Zsailer and @jaipreet-s have cycles in the coming months to work on the telemetry system. I propose that we create a new repo in the JupyterLab org for this work, possibly seeding it with Ian's previous work. We would love to see others contributing to this work as well–especially folks from organizations that 1) need to collect telemetry and 2) are experienced with doing so in a responsible and legal manner. I will also post on the discourse channel to let more people know about this.

Full disclosure - I work for AWS now so it is useful to sync on how AWS think about user data. Customer data privacy and protection is a top priority–more details can be found here:

https://aws.amazon.com/compliance/data-privacy-faq/

yuvipanda commented 5 years ago

This is awesome!

On BinderHub we have for the last 6months been collecting telemetry on the repositories that have been launched in a structured and privacy conscious way. We publish these automatically at https://archive.analytics.mybinder.org/. The backend code in BinderHub that collects, validates and emits this is written to be agnostic not where the events go, and is independent of BinderHub itself. You can see it here: https://github.com/jupyterhub/binderhub/blob/master/binderhub/events.py. Events must conform to a defined and documented schema that helps end users understand what is being collected, analysts understand what exactly is the data they are working with and system operators redact or anonymize data where needed.

https://github.com/jupyterhub/binderhub/pull/679 And issues linked from there also contain more information.

A lot of this is based on the privacy conscious, transparent and well designed (IMO) Wikimedia telemetry system. You can read more about it at https://m.mediawiki.org/wiki/Extension:EventLogging/Guide. That page is highly recommended.

I would love to be involved! I also need this inside JupyterHub for pedagogical and billing purposes, so I'll probably be working on this soon anyway.

Zsailer commented 5 years ago

@yuvipanda Thanks for sharing all these great resources—this is awesome! It would be great to work on this stuff together.

ellisonbg commented 5 years ago

@yuvipanda thank you for all of the resources, will be wonderful to work with you on this stuff.

I think it is important to have a documented and user inspectable schema - great point.

There is also a broader story around telemetry for JupyterHub as well that this JupyterLab work will obviously play into - @Zsailer also has funding to work on the Jupyterhub side as well.

yuvipanda commented 5 years ago

https://www.imsglobal.org/activity/caliper is the standard that seems to integrate with LMS like Canvas for student analytics. Berkeley is currently using it. Would also be interesting to see how we can integrate with it. Ideally, we'd have a translator that can receive logs from us, modify it for clipper and send it off. This gives us a lot of flexibility in integrating with various standards as needed in various industries.

yuvipanda commented 5 years ago

@ellisonbg awesome! IMO, this is easier to start in JupyterHub first, since it's purely serverside. Doing this serverside (notebook) and clientside (JupyterLab) will be slightly more complicated, although absolutely a lot more detailed. What do you all think?

davclark commented 5 years ago

I've got a thread going on the Jupyter Discourse where this issue is included in a broader discussion about user research: https://discourse.jupyter.org/t/potential-collaboration-on-user-research/866/8

I want to keep that thread updated because even our own UX designer is more comfortable navigating discourse than GitHub - so I assume it'll keep things accessible for others as well.

But closer to the technical effort, is telemetry being discussed in regular Jupyter meetings? I'm just getting started orienting to Jupyter community practice - but please don't hesitate to point me to the things I should have already read.

I work for Gigantum - so I'm a biased voice, but I'll point out also that Gigantum keeps a record of all activity from the kernel perspective, so for things that involve executing cells, we have a good tool for inspecting that - you just need to get the Gigantum-managed Git repository from your user. If you want to know details, @dkleissa wrote a medium post about it. Please keep us in mind as a research tool, and we'd be happy to help make it work better for that purpose.

More broadly, we would like to contribute to the community, so if this ain't it, please be in touch about other approaches that could be useful for the Jupyter project.

ellisonbg commented 5 years ago

@yuvipanda I think that it will be helpful to work in parallel on both the telemetry data sources (JupyterLab, notebook server, etc.) and collection APIs. I think we will need that en ensure that the schema can both capture what needs to be collected from the sources, but is friendly to the backend storage services.

@davclark telemetry isn't actively being discussed in any in person meetings. Things like this are challenges because they span multiple orgs/repos (lab, hub, notebook server, kernels, etc.). I like the idea of using discourse to coordinate and discuss. This particular issue was more meant to focus on the JupyterLab part of the picture. Maybe we should use discourse to organize the broader effort?

yuvipanda commented 5 years ago

I've started a PR with an initial implementation to discuss this on JupyterHub: https://github.com/jupyterhub/jupyterhub/pull/2542. Would love for people to take a look and engage there!

chaki commented 5 years ago

This is cool stuff - reminiscent of my (totally old) world of sensor networks... I tend to agree with @yuvipanda re: treating each JupyterLab/Hub instance like a sensor node -- it just keeps broadcasting / sending common events per these public schemas it supports, at regular intervals, whether anyone is listening or not. And let the data consumers (admins, product managers, Jupyter teams and academic researchers) create these sinks to gather data streams of interest and store them.

Wondering About:

dleen commented 5 years ago

In terms of the single-user notebook server there are multiple benefits from having telemetry:

In terms of log format it is important that the system be flexible in this regard. This is easily accomplishable by decoupling the "reporter" - the class used to writing the events to the log file. I imagine we can accomplish this in the usual manner when launching the server e.g.

jupyter notebook --NotebookApp.telemetry_reporter_class=mypackage.reporters.EventLoggingReporter

By being flexible here we allow easy integration with various monitoring backends like New Relic, Azure Log Analytics, AWS Cloudwatch etc.

Telemetry can be pretty invasive in the code base. Even though the notebook server API events are emitted to the server log today e.g. [I 12:34:56.789 NotebookApp] 302 GET / (10.0.0.116) 0.53ms, parsing these can be a bit cumbersome, and as mentioned above there is no accountability on what is being instrumented [I 12:34:56.789 NotebookApp] Saving file at /medical_records/sensitive_patient_name.ipynb, we may want structured events requiring something like:

# file: https://github.com/jupyter/notebook/blob/master/notebook/services/contents/handlers.py

class ContentsHandler(APIHandler):

    @metrics.latency       # response time
    @metrics.count         # track number of times this handler is called
    @metrics.availability  # log 500s vs 200s
    @web.authenticated
    @gen.coroutine
    def get(self, path=''):
        ...

Doing this obviously requires a large amount of changes across multiple packages. By default the metrics could be no-op, and it would be up to extension packages to actually provide an implementation, the annotations would just provide a hook. An alternative approach which would require no modifications would use monkey-patching to wrap a whitelisted set of methods. This is similar to how the New Relic python agent automatically instruments tornado: https://github.com/edmorley/newrelic-python-agent/blob/master/newrelic/newrelic/hooks/framework_tornado_r4/httpclient.py#L122

A few more thoughts:

I'm excited to see the responses so far on this issue and looking forward to collaborating!

jaipreet-s commented 5 years ago

All - great discussion on this thread!

I do think there is a need of telemetry data coming from the browser itself, especially for JupyterLab. A lot of web analytics libraries send clickstream data directly from the browser - Google Analytics, Azure, AWS Amplify. In JupyterLab, the telemetry framework can subscribe to platform events and send it down to the configured event sinks, along with an interface to report custom events. @dleen 's code examples do a great job of illustrating what this would like at at single Jupyter server.

I think there is a lot of value in taking this incrementally. Atleast for JupyterLab, we can get started with building off of Ian Rose's prototype and have a way to capture platform events and an interface for extensions to publish custom events.. There are a few high level design questions to answer, which can be do on the repo instead.

  1. Event reporting interface and schema
  2. Deployment configuration for Admins
  3. Customer opt-in / opt-out via Settings
  4. Mechanism for connecting one or more event sinks
  5. Event filtering, aggregation
ellisonbg commented 5 years ago

@ian-r-rose has transferred his initial JupyterLab telemetry repo over to the jupyterlab org:

https://github.com/jupyterlab/jupyterlab-telemetry

Let's continue the discussion there...

yuvipanda commented 5 years ago

@dleen we already collect RED (request, error, duration) metrics about all end points in notebook and JupyterHub. See https://github.com/jupyter/notebook/pull/3490 for more information. We use this heavily in many places. You can see the various visuzliations that are useful in grafana.mybinder.org.

yuvipanda commented 5 years ago

I spent a bunch of time on this today, and here are some results.

here is a prototype demo of eventlogging where we capture all commands executed in lab in a schema conformant, type-safe (ish) way, and configurably log it (in this case) to a file. Alongside is a 2000 word strawman design document that I hope will help discussion.

Would love for the conversations to happen on those PRs! However, if this isn't the process y'all prefer for this, we can find some other way to do this.