Open yetudada opened 6 months ago
Status of personal data collection and consent in adjacent products presented:
Project name | Tracks personal data | Uses opt-in consent | Opt-out mechanism | Telemetry collection mechanism is an optional dependency | Tracks individual users | Documentation | Comments |
---|---|---|---|---|---|---|---|
Prefect | No :x: | No :x: | Environment variable | No :-1: | No :-1: they have a session_id instead |
https://docs.prefect.io/latest/api-ref/prefect/settings/?h=prefect_server_analytics_enabled#prefect.settings.PREFECT_SERVER_ANALYTICS_ENABLED | PREFECT_SERVER_ANALYTICS_ENABLED is True by default |
Great Expectations | No :x: | No :x: | Project settings + Global settings + Environment variable | No :-1: | Yes :+1: they write a oss_id to ~/.great_expectations/great_expectations.conf |
https://docs.greatexpectations.io/docs/reference/learn/usage_statistics/ | Full schemas in https://github.com/great-expectations/great_expectations/blob/develop/great_expectations/core/usage_statistics/schemas.py |
DVC | No :x: | No :x: | Project settings + Global settings + Environment variable | No :-1: | Yes :+1: they store a user_id in ~/.config/iterative/telemetry using pypi/iterative-telemetry |
https://dvc.org/doc/user-guide/analytics#anonymized-usage-analytics | "This does not allow us to track individual users", uses https://pypi.org/project/iterative-telemetry/ |
Evidently | No :x: | No :x: | Environment variable | No :-1: | Yes :+1: they store a user_id in ~/.config/evidentlyai/telemetry using pypi/iterative-telemetry |
https://docs.evidentlyai.com/support/telemetry | "We only collect anonymous usage data. We DO NOT collect personal data" Uses https://pypi.org/project/iterative-telemetry/ |
Homebrew | No :x: | No :x: | Global settings + Environment variable | No :-1: | No :-1: they used to do it by storing a UUID in a user-wide git-like config file, but removed user tracking 1 year ago | https://docs.brew.sh/Analytics | All stats are public https://formulae.brew.sh/analytics/ |
LangChain | Unclear :question: | No :question: | Undocumented | Not applicable 🚫 | ? | No docs, only mention in https://blog.langchain.dev/langchain-state-of-ai-2023/ | LangSmith is a commercial platform, not an open source component |
Reflex | No :x: | No :x: | Project settings | No :-1: | Yes :+1: they generate a installation_id in ~/.local/share/reflex |
https://reflex.dev/docs/getting-started/configuration/#anonymous-usage-statistics | |
Streamlit (OSS) | No :x: | No :x: | Project settings | No :-1: | Unclear ❓ They use front-end (rather than back-end) analytics powered by the Segment SDK (Analytics.js) | https://docs.streamlit.io/library/advanced-features/configuration#telemetry | The Privacy Notice covers both the open source library ("the Software") and Streamlit Cloud ("the Service"), and the latter does collect personal data https://streamlit.io/privacy-policy |
dbt | No :x: | No :x: | Project settings + Environment variable | No :-1: | Unclear ❓ they have the concept of active_user , but looks like the open source code is not setting it |
https://docs.getdbt.com/reference/global-configs/usage-stats |
Added Streamlit OSS (also does not collect personal data) thanks @Joseph-Perkins!
Added dbt
Pending: Add another column that shows whether the systems track individual users or no
Done 👍🏽
There's a couple of things in this issue. On one hand, we compiled a list of similar libraries to have references of how other projects do telemetry, and we also asked for legal advice. That is already done https://github.com/kedro-org/kedro-plugins/issues/510#issuecomment-1885237973
On the other hand, there's the list of use cases @yetudada created in https://github.com/kedro-org/kedro-plugins/issues/510#issue-2070700422. Before getting to those we want to simplify our data collection process https://github.com/kedro-org/kedro-plugins/issues/375, for which we want to address #333 (done) and #507 (in progress).
For now this issue is blocked, for clarity I'm removing it from the current sprint and focusing on #507.
Regardless, it's a good moment to make a release of kedro-telemetry cc @merelcht
Introduction
Analytics play a critical role in product management. As Marty Cagan highlights, analytics are essential for understanding user behaviour, measuring product progress, validating product ideas, informing decisions, and inspiring product work. In the context of Kedro, we have telemetry tools that help us qualitatively understand our users, namely:
kedro-telemetry
, which gives insight into the feature usage and user adoption of the CLI in Kedro Framework and the CLI and UI of Kedro-Vizkedro-telemetry
is the focus of this GitHub issue.What principles should we adopt to govern the improvements of
kedro-telemetry
?With all of these potential changes to
kedro-telemetry
, I thought it would be helpful to ground our work in certain principles that affect our users and our team. Therefore, I propose we adopt the following principles when improvingkedro-telemetry
:kedro-telemetry
are reliable and accurate. Team members should have full confidence in the data they're using to make decisions.kedro-telemetry
, including its activation process, ensuring informed consent and understanding.kedro-telemetry
to provide insights that are directly applicable to product improvement strategies.How was
kedro-telemetry
designed?We have detailed some of the ways that
kedro-telemetry
was designed in a separate GitHub issue (#506).What are the current challenges with its implementation?
There is room for improvement for the current implementation of
kedro-telemetry
. I've tried to capture all known issues here but let me know if I'm missing some and I'll update the details here.kedro-telemetry
does not interrupt the CI/CD workflow, right now users have to check the documentation whenkedro-telemetry
will interrupt their workflowkedro-telemetry
as a mandatory dependency meaning that users will havekedro-telemetry
packaged in Kedro and it will no longer be part of the requirements of the starterskedro-telemetry
works with Databrickskedro-telemetry
to our userspackage_name
andproject_name
and investigate whyproject_name
is a blank field in our datakedro viz
CLI command runs differs to the number of users of Kedro-Viz according to Heap AnalyticsWhat else could we learn from our users?
I'll always be forward-looking on how we could continue to learn more about our users and even improve our existing metrics. I'd like to use a key to detail the status of the metric.
Status of metric:
kedro-telemetry
collects and hashes the computer's username upon user consent for user ID generation and counting.kedro-telemetry
user data i.e. ifkedro-telemetry
user data declines by PyPI downloads increase then that might be a sign.kedro-telemetry
hashes package_name and project_name for project ID generation and counting.kedro-datasets
kedro-telemetry
is active, a hook counts this figure.kedro-telemetry
is active, a hook counts this figure.kedro-telemetry
is active, a hook counts this figure.kedro-telemetry
versionskedro-telemetry
is active, a hook reads this figure from their project.kedro-telemetry
is active, a hook reads this figure from their project.kedro-telemetry
is active, a hook reads this figure from their project.kedro-telemetry
is active, a hook counts this figure.What are other projects that we can be inspired by?
I'm just going to list them and not detail what they're about and what we could learn:
telemetry-python
by DVC)