kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.89k stars 897 forks source link

Added 'data_dict' attribute (DataDictDataset) to AbstractVersionedDat… #3737

Closed noamgoldberg closed 3 months ago

noamgoldberg commented 6 months ago

…aset

Description

Development notes

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

astrojuanlu commented 6 months ago

Hi @noamgoldberg, thanks for your PR ! Could you explain the rationale behind this? What problem does it solve?

noamgoldberg commented 6 months ago

Hi @noamgoldberg, thanks for your PR ! Could you explain the rationale behind this? What problem does it solve?

I use kedro a lot for personal projects, and it's helpful to have a data dictionary attached to large datasets. For example, I like to create data_dict.yml with the feature descriptions, ranges, and general source information, to be referenced in Jupyter notebooks and used dynamically in the code (i.e. visualizations, reports). The DataDictDataset class is rather straightforward AbstractDataset, but the unique and helpful change in this PR is the enablement of the attachment of an instance of DataDictDataset to other datasets inheriting from AbstractVersionedDataset (i.e. pandas.CSVDataSet). For example, this would enable the following entry in catalog.yml:

    stocks_data:
        type: pandas.CSVDataSet
        filepath: data/01_raw/stocks.csv
        data_dict:
            dataset: yaml.YAMLDataSet
            filepath: data/01_raw/data_dict.yml

This would create a dataset stocks_data with an attached data dictionary.

merelcht commented 6 months ago

@noamgoldberg so this data_dict basically contains metadata about the dataset?

noamgoldberg commented 6 months ago

@merelcht yes :) I mainly use it for feature definitions and basic dataset information (i.e. author, source, location/date created)

astrojuanlu commented 5 months ago

Hi @noamgoldberg, sorry it took us so long to get back to you.

IIUC, the data_dict you propose here already exists and it's called metadata. See an example here:

https://docs.kedro.org/projects/kedro-viz/en/latest/kedro-viz_visualisation.html#visualise-layers

Please confirm if that would suit your needs. Arguably we could do a better job at documenting it, most likely here: https://docs.kedro.org/en/stable/data/data_catalog.html

ankatiyar commented 3 months ago

Hey @noamgoldberg, thanks for this PR. Just wanted to check if the metadata feature that @astrojuanlu linked is sufficient for your usecase? We'll close this PR if so!

ankatiyar commented 3 months ago

I'll close this for now, @noamgoldberg, do reach out to us and/or open an issue if the above mentioned feature is not sufficient!