kedro-org / kedro-devrel

Kedro developer relations team use this for content creation ideation and execution
Apache License 2.0
0 stars 3 forks source link

Blog post: The story of `kedro-telemetry` - from start to now #125

Open yetudada opened 5 months ago

yetudada commented 5 months ago

Introduction

I thought it'd be cool to share a detailed story of our kedro-telemetry journey – where we started, the challenges we faced, and how far we've come. It’s been quite a ride, and I think it’s essential for all of us to understand the backstory, especially as we keep improving this tool.

What were the early days?

Remember how nervous we were about starting telemetry? We all saw those threads on Reddit and Hackernews about other open-source projects getting heat for how they handled user data. Plus, being privacy nerds ourselves, we wanted to ensure we were doing right by our users. And, of course, there was that added pressure of Kedro being an enterprise-owned open-source project back then – we didn’t want any missteps to affect our reputation.

Therefore, we had serious brainstorming sessions with our InfoSec and Legal teams to ensure we were GDPR compliant. This was a challenge because we tried to interpret the law and how it applied to us. The legal team that we work with now is in LF AI & Data.

GDPR will always apply to us because we have users in the EU ✅

What design and architectural decisions did we make?

  1. User consent and transparency: Emphasising an opt-in/opt-out mechanism for user participation in telemetry via the CLI and recording the user's decision in a .telemetry file that was not committed to git. This meant that users were asked to opt-in to kedro-telemetry; if they said yes, then the decision only applied to their project where .telemetry was present, and the decision applied to the Kedro CLI, Kedro-Viz CLI and Kedro-Viz UI.
  2. Scope of data collection: Deliberately limit data to project, user, and feature statistics, avoiding personal user data. This included anonymising project and user data with hashing.
  3. Directional insights over exact figures: Given the opt-in nature, we aimed for broad trends rather than precise user data. This insight was learned from the Great Expectations product team because they also struggle to derive exact insights.
  4. Internal user identification: We developed a methodology for identifying internal users while respecting their autonomy in opting for telemetry. It used a hashmap of username to identify internal users only because we could hash internal username. This methodology is inactive now - talk to @datajoely.
  5. Separation from Kedro Framework: To ensure users could remove telemetry without impacting their core experience with Kedro.
  6. Documentation: Allowing users to access detailed documentation on data collection by reading our data collection methodology.

Opt-in/opt-out workflow of kedro-telemetry

Telemetry Workflow - Creating a new project (1)

What data do we collect?

kedro-telemetry has evolved to collect more data as we have had more questions about our users. It's easiest to see aspects of this as a table and describe additional collection points. When users opt-in to using kedro-telemetry, kedro-telemetry will collect project and user metadata, record usage of the Kedro Framework and Kedro-Viz CLI and track all feature usage of the Kedro-Viz UI. Identifying project (project name and package name) and user (computer name and username) metadata is hashed for anonymity requirements.

Description Example Input What we receive
CLI command (masked arguments) kedro run --pipeline=ds --env=test kedro run --pipeline ***** --env *****
(Hashed) Package name my-project 1c7cd944c28cd888904f3efc2345198507...
(Hashed) Project name my_project a6392d359362dc9827cf8688c9d634520e...
(Hashed) Username my_username ec3759e2c570d302e65ea20a7d985...
kedro project version 0.17.6 0.17.6
kedro-telemetry version 0.1.2 0.1.2
Python version 3.8.10 (default, Jun 2 2021, 10:49:15) 3.8.10 (default, Jun 2 2021, 10:49:15)
Operating system used darwin darwin
Number of datasets 7 7
Number of pipelines 2 2
Number of nodes 12 12

What was the original data collection strategy for kedro-telemetry?

Here's what the first version of kedro-telemetry proposed doing: Telemetry Workflow - What data do we want to collect_ (1)

What analytics tools does kedro-telemetry integrate with?

To facilitate in-depth data analysis, kedro-telemetry employs Heap Analytics and Snowflake databases as data stores. This integration allows us to process complex datasets and glean valuable insights into how users interact with Kedro, influencing our development strategies.

yetudada commented 5 months ago

@idanov Had a great point about reading more into GDPR to understand the design of kedro-telemetry.

There's one important thing though, we follow an opt-in based consent due to GDPR. Here's the differences: https://termly.io/resources/articles/opt-in-vs-opt-out/

And here there's a nice table to compare both: https://seersco.com/articles/opt-in-vs-opt-out-consent/

Opt-in flow in the context of data collection means that the user has to explicitly give their consent before we start collecting any data.

Opt-out flow means that the user has has the right to withdraw their consent at any time, but we might still start collecting data by default even without their initial consent.

GDPR requires that users must be given the option to enable cookies out of their free will. Since there are various types of cookies serving different purposes, such as advertising cookies and analytics cookies, the user must have separate opt-in checkboxes for different cookie categories based on their purposes. In short, the GDPR requires consent to be opt-in.

GDPR defines consent as “freely given, specific, informed and unambiguous” given by a “clear affirmative action.” It is not acceptable to assign consent through the data subject’s silence or by supplying “pre-ticked boxes.”

stichbury commented 5 months ago

I'm really pleased to see this, thanks @yetudada for the writeup! I spotted a gap for some telemetry content last year (https://github.com/kedro-org/kedro-devrel/issues/59) and figured we could rank strongly for it if we write something, so this is ideal. I'm reading/reviewing today 👀

stichbury commented 5 months ago

Moved this into the kedro-devrel repo so I can use it to form the basis of a blog post.