kedro-org / kedro-plugins

First-party plugins maintained by the Kedro team.
Apache License 2.0
92 stars 89 forks source link

Check Telemetry data accuracy and add example_pipeline information #543

Closed DimedS closed 7 months ago

DimedS commented 8 months ago

Description

  1. During the telemetry collection process, we gather data on the number of datasets, nodes, and pipelines within a project, as well as the tools selected.

After executing kedro new, selecting tools 1-5, and setting example=y, I visited the Heap website to review the collected data for that session. The data retrieved included three lines:

Screenshot 2024-02-07 at 12 37 43

but it lacked information regarding the tools I had chosen. Additionally, the reported numbers appeared to be inaccurate; the spaceflights-pandas starter actually contains 2 pipelines, 7 datasets, and 6 nodes:

Screenshot 2024-02-07 at 12 38 06
  1. Additionally, we should include the example_pipeline option in our data collection (already stored in pyproject.toml) to gather information alongside the selected tools.

Your Environment

Screenshot 2024-02-07 at 12 36 09
astrojuanlu commented 8 months ago

Additionally, the reported numbers appeared to be inaccurate; the spaceflights-pandas starter actually contains 2 pipelines, 7 datasets, and 6 nodes:

I suspect

About __default__, it's unclear what should we do here - I think it's correct to include it, after all users could as well delete the default code in pipeline_registry.py. For params: being counted as datasets, probably they should not be accounted.

it lacked information regarding the tools I had chosen

I recall doing an experiment early on and seeing them. Let's look into it.

Additionally, we should include the example_pipeline option in our data collection (already stored in pyproject.toml) to gather information alongside the selected tools.

Agreed 👍🏽

DimedS commented 7 months ago

we should update kedro docs - add example_pipeline in Telemetry collection chapter