Improving our understanding of our users with `kedro-telemetry`

yetudada commented 6 months ago

Introduction

Analytics play a critical role in product management. As Marty Cagan highlights, analytics are essential for understanding user behaviour, measuring product progress, validating product ideas, informing decisions, and inspiring product work. In the context of Kedro, we have telemetry tools that help us qualitatively understand our users, namely:

kedro-telemetry, which gives insight into the feature usage and user adoption of the CLI in Kedro Framework and the CLI and UI of Kedro-Viz
Standard Heap Analytics configuration, which gives insights on our documentation and website

kedro-telemetry is the focus of this GitHub issue.

What principles should we adopt to govern the improvements of `kedro-telemetry`?

With all of these potential changes to kedro-telemetry, I thought it would be helpful to ground our work in certain principles that affect our users and our team. Therefore, I propose we adopt the following principles when improving kedro-telemetry:

Trustworthy: Ensure all insights from kedro-telemetry are reliable and accurate. Team members should have full confidence in the data they're using to make decisions.
Accessible: Make insights easy to obtain and understand for all team members, facilitating informed product development.
User-aware: Clearly communicate to users about kedro-telemetry, including its activation process, ensuring informed consent and understanding.
Transparent: Provide crystal clear information about what data kedro-telemetry collects and its scope, in an easily digestible format.
Actionable: Design kedro-telemetry to provide insights that are directly applicable to product improvement strategies.
Minimal: Only collect the data that is needed, and no more.
Privacy-conscious: Ensure data collection complies with privacy laws and ethical standards, respecting user privacy at every step.
Collaborative: Facilitate sharing and discussion of insights among team members to foster a collaborative approach to product development.

How was `kedro-telemetry` designed?

We have detailed some of the ways that kedro-telemetry was designed in a separate GitHub issue (#506).

What are the current challenges with its implementation?

There is room for improvement for the current implementation of kedro-telemetry. I've tried to capture all known issues here but let me know if I'm missing some and I'll update the details here.

Theme	Problem	Priority	Linked GitHub Issue	Status
Access to statistics	Make sure everyone on the Kedro Team can access Heap Analytics and/or Snowflake and knows how to find and understand the data	P1
Data collection	Improve masking of CLI commands	P2	kedro-org/kedro-plugins#371	Done
Data collection	Develop a methodology to track users of Kedro-Viz that are shareable URL users of Kedro-Viz and thus do not activate Kedro-Viz using a CLI	P2
Developer experience	This work also includes making sure `kedro-telemetry` does not interrupt the CI/CD workflow, right now users have to check the documentation when `kedro-telemetry` will interrupt their workflow	P2	https://github.com/kedro-org/kedro/issues/1640 & kedro-org/kedro-plugins#484
Developer experience	Determine whether telemetry should be active by default and rather have an opt-out workflow. This must be investigated with the LF AI & Data legal team.	P1
Developer experience	Determine whether we should have `kedro-telemetry` as a mandatory dependency meaning that users will have `kedro-telemetry` packaged in Kedro and it will no longer be part of the requirements of the starters	P2
Developer experience	Fix how `kedro-telemetry` works with Databricks	P1	kedro-org/kedro-plugins#484
Documentation	Update our README.md on what data we're collecting about our users	P1	kedro-org/kedro-plugins#508	Done
User identification	Develop a robust method to distinguish real users from CI/CD	P1	kedro-org/kedro-plugins#483	Done
User identification	Figure out a way to have unique user identification, even if Docker is being used	P1	kedro-org/kedro-plugins#333	Done
Documentation	Decide on the best place to publish information about `kedro-telemetry` to our users	P3	kedro-org/kedro-plugins#509	Done
Project identification	Choose a singular ID for projects; we collect `package_name` and `project_name` and investigate why `project_name` is a blank field in our data	P3	kedro-org/kedro-plugins#507	Done
User identification	Figure out a new way to identify internal users of Kedro	P2
Data collection	Develop a common understanding of why the number of `kedro viz` CLI command runs differs to the number of users of Kedro-Viz according to Heap Analytics	P3

What else could we learn from our users?

I'll always be forward-looking on how we could continue to learn more about our users and even improve our existing metrics. I'd like to use a key to detail the status of the metric.

Status of metric:

🚀 - This insight exists and is trustworthy
🧐 - This insight exists, has a defined methodology but could be improved
📬 - The data for this insight exists but the insight does not exist yet
🛠️ - The data for this insight does not exist and this insight does not exist

Type of insight	Priority	Category	What does this insight allow us to do?	How is the data collected?	What limitations exist with the current implementation?	What are alternative data sources?	Status
Number of users	P1	Adoption	Tracking the number of users helps gauge Kedro's product penetration and user base size	`kedro-telemetry` collects and hashes the computer's username upon user consent for user ID generation and counting.	This insight requires kedro-telemetry installation, CLI usage, user consent, and assumes unique computer usernames, though this may not hold in cases like Docker	Depends on the use case for a total view of users, we could use MAUs from our documentation or if we wanted to guess if Kedro was being used more as a library and less than a framework then we could look at PyPI downloads with `kedro-telemetry` user data i.e. if `kedro-telemetry` user data declines by PyPI downloads increase then that might be a sign.	🧐
Number of projects	P1	Usage	Enables us to track feature adoption, total Kedro project count, and average team size per project	`kedro-telemetry` hashes package_name and project_name for project ID generation and counting.	This insight depends on kedro-telemetry installation, Kedro CLI usage, and user consent for data collection.	N/A	🧐
Number of projects in production	P2	Value proposition	Indicates if projects are reaching production, approximated by the usage in CI/CD, aligning with Kedro's value proposition	N/A	N/A	N/A	🛠️
Code quality of projects	P2	Value proposition	Kedro's usage should correlate with improved code quality in projects	N/A	N/A	N/A	🛠️
Average team size	P3	Value proposition	Kedro's effectiveness should reflect in larger team sizes on projects	Leverages insights from total number of users and number of projects.	Bound by the same limitations of the insights for number of users and number of projects. The insight may not always be accurate because certain users for a project can opt-out of telemetry.	N/A	🧐
Ratio of Spaceflights projects to real Kedro projects	P2	Product development	Identifies if users stop at generating example projects without further Kedro engagement	N/A	N/A	We could approximate this data by looking at the project size data i.e. we'd calcuate the size of a Spaceflights project and see how many times this size project appears in our data.	📬
% of custom datasets used	P2	Product development	Reveals the prevalence of custom datasets, indicating unsupported data types.	N/A	N/A	N/A	🛠️
Types of datasets used from `kedro-datasets`	P1	Usage	Identifies priority datasets for fixes and feature development and the balance of supported versus custom datasets	N/A	N/A	N/A	🛠️
Types of cloud platforms used	P1	Product development	Informs us about which cloud platforms to prioritize based on usage data	N/A	N/A	We could piggyback off the "types of datasets" data collection and collect the fsspec registry data.	🛠️
Error tracking	P1	Usage	Helps identify user issues with Kedro and prioritize errors for clearer resolution	N/A	N/A	N/A	🛠️
Number of datasets used	P3	Usage	Gauges Kedro project sizes, relevant to our claim of aiding larger data science projects	When `kedro-telemetry` is active, a hook counts this figure.	This insight depends on kedro-telemetry installation, Kedro CLI usage, and user consent for data collection.	N/A	🚀
Number of pipelines created	P3	Usage	Gauges Kedro project sizes, relevant to our claim of aiding larger data science projects	When `kedro-telemetry` is active, a hook counts this figure.	This insight depends on kedro-telemetry installation, Kedro CLI usage, and user consent for data collection.	N/A	🚀
Number of nodes	P3	Usage	Gauges Kedro project sizes, relevant to our claim of aiding larger data science projects	When `kedro-telemetry` is active, a hook counts this figure.	This insight depends on kedro-telemetry installation, Kedro CLI usage, and user consent for data collection.	N/A	🚀
Telemetry version	P1	Product development	Identifies gaps in telemetry metrics due to data collection starting in newer `kedro-telemetry` versions	When `kedro-telemetry` is active, a hook reads this figure from their project.	This insight depends on kedro-telemetry installation, Kedro CLI usage, and user consent for data collection.	Could look at PyPI download data	🚀
Python version	P1	Product development	Aids in deciding which Python versions to sunset, in conjunction with Kedro's download data	When `kedro-telemetry` is active, a hook reads this figure from their project.	This insight depends on kedro-telemetry installation, Kedro CLI usage, and user consent for data collection.	Could look at PyPI download data	🚀
Project version	P1	Product development	Reveals telemetry data gaps and the popularity of specific Kedro versions	When `kedro-telemetry` is active, a hook reads this figure from their project.	This insight depends on kedro-telemetry installation, Kedro CLI usage, and user consent for data collection.	Could look at PyPI download data	🚀
Commands run from the CLI	P1	Product development	Shows usage patterns of CLI features in Kedro	When `kedro-telemetry` is active, a hook counts this figure.	This insight depends on kedro-telemetry installation, Kedro CLI usage, and user consent for data collection.	N/A	🧐
Ratio of library or framework + project template users	P2	Product development	Assesses Kedro's library versus framework adoption	N/A	N/A	N/A	🛠️
Usage frequency	P1	Usage	Determines if users are repeat or one-time Kedro users	N/A	N/A	N/A	📬
Dependency analysis	P1	Product development	Helps identify which integrations to build with Kedro	N/A	N/A	N/A	🛠️
Duration of a project	P2	Value proposition	Evaluates the longevity of Kedro projects in relation to production readiness	N/A	N/A	N/A	📬

What are other projects that we can be inspired by?

I'm just going to list them and not detail what they're about and what we could learn:

Prefect: https://github.com/PrefectHQ/prefect/blob/main/src/prefect/server/services/telemetry.py
Great Expectations: https://github.com/great-expectations/great_expectations/blob/develop/great_expectations/core/usage_statistics/usage_statistics.py
DVC: https://dvc.org/doc/user-guide/analytics#anonymized-usage-analytics
Evidently: https://github.com/evidentlyai/evidently/blob/main/src/evidently/telemetry.py (uses a package called telemetry-python by DVC)
Homebrew: https://docs.brew.sh/Analytics
LangChain using LangSmith: https://blog.langchain.dev/langchain-state-of-ai-2023/
Reflex: https://reflex.dev/docs/advanced-guide/telemetry/

astrojuanlu commented 6 months ago

Status of personal data collection and consent in adjacent products presented:

Project name	Tracks personal data	Uses opt-in consent	Opt-out mechanism	Telemetry collection mechanism is an optional dependency	Tracks individual users	Documentation	Comments
Prefect	No :x:	No :x:	Environment variable	No :-1:	No :-1: they have a `session_id` instead	https://docs.prefect.io/latest/api-ref/prefect/settings/?h=prefect_server_analytics_enabled#prefect.settings.PREFECT_SERVER_ANALYTICS_ENABLED	`PREFECT_SERVER_ANALYTICS_ENABLED` is True by default
Great Expectations	No :x:	No :x:	Project settings + Global settings + Environment variable	No :-1:	Yes :+1: they write a `oss_id` to `~/.great_expectations/great_expectations.conf`	https://docs.greatexpectations.io/docs/reference/learn/usage_statistics/	Full schemas in https://github.com/great-expectations/great_expectations/blob/develop/great_expectations/core/usage_statistics/schemas.py
DVC	No :x:	No :x:	Project settings + Global settings + Environment variable	No :-1:	Yes :+1: they store a `user_id` in `~/.config/iterative/telemetry` using `pypi/iterative-telemetry`	https://dvc.org/doc/user-guide/analytics#anonymized-usage-analytics	"This does not allow us to track individual users", uses https://pypi.org/project/iterative-telemetry/
Evidently	No :x:	No :x:	Environment variable	No :-1:	Yes :+1: they store a `user_id` in `~/.config/evidentlyai/telemetry` using `pypi/iterative-telemetry`	https://docs.evidentlyai.com/support/telemetry	"We only collect anonymous usage data. We DO NOT collect personal data" Uses https://pypi.org/project/iterative-telemetry/
Homebrew	No :x:	No :x:	Global settings + Environment variable	No :-1:	No :-1: they used to do it by storing a UUID in a user-wide git-like config file, but removed user tracking 1 year ago	https://docs.brew.sh/Analytics	All stats are public https://formulae.brew.sh/analytics/
LangChain	Unclear :question:	No :question:	Undocumented	Not applicable 🚫	?	No docs, only mention in https://blog.langchain.dev/langchain-state-of-ai-2023/	LangSmith is a commercial platform, not an open source component
Reflex	No :x:	No :x:	Project settings	No :-1:	Yes :+1: they generate a `installation_id` in `~/.local/share/reflex`	https://reflex.dev/docs/getting-started/configuration/#anonymous-usage-statistics
Streamlit (OSS)	No :x:	No :x:	Project settings	No :-1:	Unclear ❓ They use front-end (rather than back-end) analytics powered by the Segment SDK (Analytics.js)	https://docs.streamlit.io/library/advanced-features/configuration#telemetry	The Privacy Notice covers both the open source library ("the Software") and Streamlit Cloud ("the Service"), and the latter does collect personal data https://streamlit.io/privacy-policy
dbt	No :x:	No :x:	Project settings + Environment variable	No :-1:	Unclear ❓ they have the concept of `active_user`, but looks like the open source code is not setting it	https://docs.getdbt.com/reference/global-configs/usage-stats

astrojuanlu commented 5 months ago

Added Streamlit OSS (also does not collect personal data) thanks @Joseph-Perkins!

astrojuanlu commented 4 months ago

Added dbt

astrojuanlu commented 4 months ago

Pending: Add another column that shows whether the systems track individual users or no

astrojuanlu commented 4 months ago

Done 👍🏽

astrojuanlu commented 3 months ago

There's a couple of things in this issue. On one hand, we compiled a list of similar libraries to have references of how other projects do telemetry, and we also asked for legal advice. That is already done https://github.com/kedro-org/kedro-plugins/issues/510#issuecomment-1885237973

On the other hand, there's the list of use cases @yetudada created in https://github.com/kedro-org/kedro-plugins/issues/510#issue-2070700422. Before getting to those we want to simplify our data collection process https://github.com/kedro-org/kedro-plugins/issues/375, for which we want to address #333 (done) and #507 (in progress).

For now this issue is blocked, for clarity I'm removing it from the current sprint and focusing on #507.

Regardless, it's a good moment to make a release of kedro-telemetry cc @merelcht

kedro-org / kedro-plugins