Improve documentation on how to configure the dataset

kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.

https://kedro.org

Apache License 2.0

9.47k stars 875 forks source link

Improve documentation on how to configure the dataset #3919

Closed ElenaKhaustova closed 5 days ago

ElenaKhaustova commented 1 month ago

Description

Users struggle to understand how to configure datasets properly, resulting in frustration. They miss the existence of the Kedro-Datasets component and from the Kedro documentation, they struggle to get on how to set up the parameters for datasets.

We propose adding a configuration example with the reference to the https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-3.0.1. Specifically how to set up kedro- and dataset-related parameters.

Documentation page (if applicable)

https://docs.kedro.org/en/stable/data/data_catalog.html

Context

"They tend not not know the underlying library connected to the datasets. They need to be redirected to the right place in the documentation (e.g. pandas.CSVDataset API doc)" (C)

astrojuanlu commented 3 weeks ago

Is better documentation enough to address this though?

For example, this was the first comment a user made when joining our Slack:

Hey folks! Just started using kedro. Is there any kedro command to import datasets from a path into my data directory in the project?

(https://linen-slack.kedro.org/t/9703502/hey-folks-just-started-using-kedro-is-there-any-kedro-comman#296704bb-7be1-419c-94b2-2429086acbea, cc @juanmarin00)

In the same way we have kedro pipeline create, we could have kedro dataset import /tmp/my_data.csv or something like that, populating the catalog for you.

astrojuanlu commented 3 weeks ago

Also unclear if this is related to the DataCatalog API itself, but more of a Kedro DX thing in general.

merelcht commented 3 weeks ago

I'd be curious to know what's really meant with "configuring" a datasets. We have a huge amount of docs on yaml examples: https://docs.kedro.org/en/stable/data/data_catalog_yaml_examples.html, but if that's not what users are looking for then what is it they'd like to see?

ElenaKhaustova commented 1 week ago

@astrojuanlu, @merelcht, what we got from the interviews is that less experienced users are missing the connection between DataCatalog, Dataset and the actual python package encapsulated with the specific dataset implementation, aka working with pandas. When users want to add dataset configuration into the catalog.yml it's not obvious for some of them that the set of the dataset configuration parameters is defined by its implementation (filepath, load_args, etc), but for example load_args are defined by the underlying library like pandas.

We can add a small example to the docs to clarify the dependency DataCatalog -> Dataset -> underlying library.