Check Telemetry data accuracy and add example_pipeline information

DimedS commented 8 months ago

Description

During the telemetry collection process, we gather data on the number of datasets, nodes, and pipelines within a project, as well as the tools selected.

After executing kedro new, selecting tools 1-5, and setting example=y, I visited the Heap website to review the collected data for that session. The data retrieved included three lines:

but it lacked information regarding the tools I had chosen. Additionally, the reported numbers appeared to be inaccurate; the spaceflights-pandas starter actually contains 2 pipelines, 7 datasets, and 6 nodes:

Additionally, we should include the example_pipeline option in our data collection (already stored in pyproject.toml) to gather information alongside the selected tools.

Your Environment

astrojuanlu commented 8 months ago

Additionally, the reported numbers appeared to be inaccurate; the spaceflights-pandas starter actually contains 2 pipelines, 7 datasets, and 6 nodes:

I suspect

We're counting __default__ as an extra pipeline,
We're counting params: as datasets

About __default__, it's unclear what should we do here - I think it's correct to include it, after all users could as well delete the default code in pipeline_registry.py. For params: being counted as datasets, probably they should not be accounted.

it lacked information regarding the tools I had chosen

I recall doing an experiment early on and seeing them. Let's look into it.

Additionally, we should include the example_pipeline option in our data collection (already stored in pyproject.toml) to gather information alongside the selected tools.

Agreed 👍🏽

DimedS commented 7 months ago

we should update kedro docs - add example_pipeline in Telemetry collection chapter

kedro-org / kedro-plugins

Check Telemetry data accuracy and add example_pipeline information #543

Description

Your Environment