JupyterLab telemetry extension

ellisonbg commented 5 years ago

Background

For the past ~2 years we have had a number of conversations about building a telemetry system for JupyterLab. By a telemetry system, we mean a system for collecting data about what actions users are performing in JupyterLab, how they are taking those actions (mouse, keyboard shortcut, button), and when. Telemetry data serves a number of purposes for organizations deploying Jupyter: i) operational and security monitoring, ii) intrusion detection, iii) compliance auditing, and iv) a-posteriori analysis of the platform’s usage and design–i.e. as a UX research tool.

Tenets

There are certainly ethical and legal questions around telemetry systems. To address these, I propose the following tenets of the telemetry system:

Data protection by design and default. JupyterLab and its telemetry system should come with builtin tools that enable safe and secure deployments with all types of data. See Art. 25 of GDPR for details about this tenet.
Make it easy to do the right things by default. There are many ways to collect and use telemetry data that are illegal and/or irresponsible. JupyterLab's telemetry system should encode best practices and make it easy for operators to be responsible and comply with relevant laws.
Choice in balancing tradeoffs. There are two types of data JupyterLab: 1) the actual datasets users are working with in notebooks, and 2) telemetry data about the Jupyter users. At times, protecting these two types of data at the same time will require tradeoffs. For example, if a deployment is doing research with sensitive HIPAA or FERPA data, the operators need to closely monitor every action taken by its researchers using JupyterLab to ensure the sensitive data is used appropriately. At the same time, in some jurisdictions (EU) Jupyter users may be protected by local laws (GDPR) about what telemetry data can be recorded, how it can be used, the terms of that usage.
Don't ignore the need for telemetry. Organizations deploying Jupyter need to collect telemetry for a range of purposes. If we ignore this need, they will route around the project, with potential legal and ethical complications. By being proactive, we can establish best practices and guardrails.

Current status

@ian-r-rose has created an exploratory JupyterLab telemetry extensions here:

https://github.com/ian-r-rose/jupyterlab-telemetry

This extension shows some of the major pieces requires:

Hooks into JupyterLab's command system to record telemetry data when commands are run.
Metadata added to the points in the code base where commands are triggered to indicate how they are triggered (mouse, keyboard, click, menu, etc.).
Some sort of API/Interface for sending telemetry to an appropriate data store.
Interfaces and UI components pertaining to the user experience of telemetry. This includes enabling users to approve or revoke permission to collect telemetry data, displaying telemetry related statuses, inspect and download it, etc.

It may also make sense to build differential privacy into our telemetry collection code to protect users. This isn't appropriate for all deployments, but in many cases it would be extremely helpful.

Next steps

@Zsailer and @jaipreet-s have cycles in the coming months to work on the telemetry system. I propose that we create a new repo in the JupyterLab org for this work, possibly seeding it with Ian's previous work. We would love to see others contributing to this work as well–especially folks from organizations that 1) need to collect telemetry and 2) are experienced with doing so in a responsible and legal manner. I will also post on the discourse channel to let more people know about this.

Full disclosure - I work for AWS now so it is useful to sync on how AWS think about user data. Customer data privacy and protection is a top priority–more details can be found here:

https://aws.amazon.com/compliance/data-privacy-faq/

yuvipanda commented 5 years ago

This is awesome!

On BinderHub we have for the last 6months been collecting telemetry on the repositories that have been launched in a structured and privacy conscious way. We publish these automatically at https://archive.analytics.mybinder.org/. The backend code in BinderHub that collects, validates and emits this is written to be agnostic not where the events go, and is independent of BinderHub itself. You can see it here: https://github.com/jupyterhub/binderhub/blob/master/binderhub/events.py. Events must conform to a defined and documented schema that helps end users understand what is being collected, analysts understand what exactly is the data they are working with and system operators redact or anonymize data where needed.

https://github.com/jupyterhub/binderhub/pull/679 And issues linked from there also contain more information.

A lot of this is based on the privacy conscious, transparent and well designed (IMO) Wikimedia telemetry system. You can read more about it at https://m.mediawiki.org/wiki/Extension:EventLogging/Guide. That page is highly recommended.

I would love to be involved! I also need this inside JupyterHub for pedagogical and billing purposes, so I'll probably be working on this soon anyway.

Zsailer commented 5 years ago

@yuvipanda Thanks for sharing all these great resources—this is awesome! It would be great to work on this stuff together.

ellisonbg commented 5 years ago

@yuvipanda thank you for all of the resources, will be wonderful to work with you on this stuff.

I think it is important to have a documented and user inspectable schema - great point.

There is also a broader story around telemetry for JupyterHub as well that this JupyterLab work will obviously play into - @Zsailer also has funding to work on the Jupyterhub side as well.

yuvipanda commented 5 years ago

https://www.imsglobal.org/activity/caliper is the standard that seems to integrate with LMS like Canvas for student analytics. Berkeley is currently using it. Would also be interesting to see how we can integrate with it. Ideally, we'd have a translator that can receive logs from us, modify it for clipper and send it off. This gives us a lot of flexibility in integrating with various standards as needed in various industries.

yuvipanda commented 5 years ago

@ellisonbg awesome! IMO, this is easier to start in JupyterHub first, since it's purely serverside. Doing this serverside (notebook) and clientside (JupyterLab) will be slightly more complicated, although absolutely a lot more detailed. What do you all think?

davclark commented 5 years ago

I've got a thread going on the Jupyter Discourse where this issue is included in a broader discussion about user research: https://discourse.jupyter.org/t/potential-collaboration-on-user-research/866/8

I want to keep that thread updated because even our own UX designer is more comfortable navigating discourse than GitHub - so I assume it'll keep things accessible for others as well.

But closer to the technical effort, is telemetry being discussed in regular Jupyter meetings? I'm just getting started orienting to Jupyter community practice - but please don't hesitate to point me to the things I should have already read.

I work for Gigantum - so I'm a biased voice, but I'll point out also that Gigantum keeps a record of all activity from the kernel perspective, so for things that involve executing cells, we have a good tool for inspecting that - you just need to get the Gigantum-managed Git repository from your user. If you want to know details, @dkleissa wrote a medium post about it. Please keep us in mind as a research tool, and we'd be happy to help make it work better for that purpose.

More broadly, we would like to contribute to the community, so if this ain't it, please be in touch about other approaches that could be useful for the Jupyter project.

ellisonbg commented 5 years ago

@yuvipanda I think that it will be helpful to work in parallel on both the telemetry data sources (JupyterLab, notebook server, etc.) and collection APIs. I think we will need that en ensure that the schema can both capture what needs to be collected from the sources, but is friendly to the backend storage services.

@davclark telemetry isn't actively being discussed in any in person meetings. Things like this are challenges because they span multiple orgs/repos (lab, hub, notebook server, kernels, etc.). I like the idea of using discourse to coordinate and discuss. This particular issue was more meant to focus on the JupyterLab part of the picture. Maybe we should use discourse to organize the broader effort?

yuvipanda commented 5 years ago

I've started a PR with an initial implementation to discuss this on JupyterHub: https://github.com/jupyterhub/jupyterhub/pull/2542. Would love for people to take a look and engage there!

chaki commented 5 years ago

This is cool stuff - reminiscent of my (totally old) world of sensor networks... I tend to agree with @yuvipanda re: treating each JupyterLab/Hub instance like a sensor node -- it just keeps broadcasting / sending common events per these public schemas it supports, at regular intervals, whether anyone is listening or not. And let the data consumers (admins, product managers, Jupyter teams and academic researchers) create these sinks to gather data streams of interest and store them.

Wondering About:

How and where are these schemas created and maintained? I can see varying levels of data as discussed that are specific to individual users, organizations and even cross-organization (e.g., universities) and public (is that out of scope @ellisonbg?) (e.g., Jupyter teams / research) and thus a wide array of schemas. Perhaps there are common "systems schemas" that a few folks decide upon and maintain; and then anything more advanced need to be done at a higher-level somehow.
Need to dive into the Wikimedia stuff more; but my hunch is that we might need something that looks more like a PubSub system.
If the use case includes possibly collect data for any instance regardless of organization boundaries, especially for research purposes, there's a LOT of cool data sets many may find exciting : >
The UX of making things clear and super transparent to users / organizations is perhaps critical; so users don't turn off these. It's a great experiment in itself.

dleen commented 5 years ago

In terms of the single-user notebook server there are multiple benefits from having telemetry:

A system administrator can use metrics to confidently deploy new versions of Jupyter to her users. By monitoring 500s, active sessions, websocket activity she can ensure that there is no disruption.
A system administrator can upgrade system packages or Conda environments (as an example) and monitor kernel activity as one way of knowing the upgrade has been successful.
A developer of Juypterlab extensions can instrument their code to help in detecting regressions.
Jupyterlab developers can use metrics such as API latency when looking for optimizations - e.g. (a hypothetical situation) after inspecting graphs we see the p50 for GET /api/contents is 100ms, but the p99 is 5000ms. Most users are happy but some are having a bad time - after investigating the requests with latency above 1000ms we see a large number of files being returned. We decide to add pagination to the API. Under normal circumstance we'd have to wait for some users to create issues complaining about slowness, but with telemetry we can be more proactive.

In terms of log format it is important that the system be flexible in this regard. This is easily accomplishable by decoupling the "reporter" - the class used to writing the events to the log file. I imagine we can accomplish this in the usual manner when launching the server e.g.

jupyter notebook --NotebookApp.telemetry_reporter_class=mypackage.reporters.EventLoggingReporter

By being flexible here we allow easy integration with various monitoring backends like New Relic, Azure Log Analytics, AWS Cloudwatch etc.

Telemetry can be pretty invasive in the code base. Even though the notebook server API events are emitted to the server log today e.g. [I 12:34:56.789 NotebookApp] 302 GET / (10.0.0.116) 0.53ms, parsing these can be a bit cumbersome, and as mentioned above there is no accountability on what is being instrumented [I 12:34:56.789 NotebookApp] Saving file at /medical_records/sensitive_patient_name.ipynb, we may want structured events requiring something like:

# file: https://github.com/jupyter/notebook/blob/master/notebook/services/contents/handlers.py

class ContentsHandler(APIHandler):

    @metrics.latency       # response time
    @metrics.count         # track number of times this handler is called
    @metrics.availability  # log 500s vs 200s
    @web.authenticated
    @gen.coroutine
    def get(self, path=''):
        ...

Doing this obviously requires a large amount of changes across multiple packages. By default the metrics could be no-op, and it would be up to extension packages to actually provide an implementation, the annotations would just provide a hook. An alternative approach which would require no modifications would use monkey-patching to wrap a whitelisted set of methods. This is similar to how the New Relic python agent automatically instruments tornado: https://github.com/edmorley/newrelic-python-agent/blob/master/newrelic/newrelic/hooks/framework_tornado_r4/httpclient.py#L122

A few more thoughts:

It is possible for the Jupyterlab extension and the server extension to have very similar interfaces - the frontend extension would simply have a reporter which calls PUT on /api/telemetry whereas the server extension reports to a file. The benefit of this is a lower cognitive burden. While it may not be the same language, at least having similar APIs helps.
Metrics that work with the notebook server should be easily usable in kernel-gateway, enterprise gateway, and I assume with Jupyterhub too.
I haven't thought much about telemetry from the kernel's perspective.
Log rotation / compression / cleanup and watching disk space can be hard to get right.
Metrics with many dimensions quickly gets expensive (on the backend).

I'm excited to see the responses so far on this issue and looking forward to collaborating!

jaipreet-s commented 5 years ago

All - great discussion on this thread!

I do think there is a need of telemetry data coming from the browser itself, especially for JupyterLab. A lot of web analytics libraries send clickstream data directly from the browser - Google Analytics, Azure, AWS Amplify. In JupyterLab, the telemetry framework can subscribe to platform events and send it down to the configured event sinks, along with an interface to report custom events. @dleen 's code examples do a great job of illustrating what this would like at at single Jupyter server.

I think there is a lot of value in taking this incrementally. Atleast for JupyterLab, we can get started with building off of Ian Rose's prototype and have a way to capture platform events and an interface for extensions to publish custom events.. There are a few high level design questions to answer, which can be do on the repo instead.

Event reporting interface and schema
Deployment configuration for Admins
Customer opt-in / opt-out via Settings
Mechanism for connecting one or more event sinks
Event filtering, aggregation

ellisonbg commented 5 years ago

@ian-r-rose has transferred his initial JupyterLab telemetry repo over to the jupyterlab org:

https://github.com/jupyterlab/jupyterlab-telemetry

Let's continue the discussion there...

yuvipanda commented 5 years ago

@dleen we already collect RED (request, error, duration) metrics about all end points in notebook and JupyterHub. See https://github.com/jupyter/notebook/pull/3490 for more information. We use this heavily in many places. You can see the various visuzliations that are useful in grafana.mybinder.org.

yuvipanda commented 5 years ago

I spent a bunch of time on this today, and here are some results.

here is a prototype demo of eventlogging where we capture all commands executed in lab in a schema conformant, type-safe (ish) way, and configurably log it (in this case) to a file. Alongside is a 2000 word strawman design document that I hope will help discussion.

Would love for the conversations to happen on those PRs! However, if this isn't the process y'all prefer for this, we can find some other way to do this.

jupyterlab / frontends-team-compass