kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.47k stars 875 forks source link

Access dataset filepath via public API for file-backed datasets  #3929

Open ElenaKhaustova opened 3 weeks ago

ElenaKhaustova commented 3 weeks ago

Description

Users encounter challenges related to accessing and managing dataset filepaths. The absence of a mandatory filepath attribute in AbstractDataset and the lack of a standard API for accessing metadata hinder users' ability to reliably access dataset filepaths and understand which dataset version was loaded. Additionally, inconsistencies between APIs across different dataset types further complicate the process, requiring users to implement custom logic to handle dataset access and metadata retrieval.

We propose:

  1. Explore the feasibility of implementing file-backed AbstractDataset and making the filepath attribute mandatory to provide users with a consistent and reliable way to access dataset filepaths.
  2. Develop a standard API for accessing metadata across different dataset types, and decide what the standard metadata should include for each dataset type.

Relates to https://github.com/kedro-org/kedro/issues/1936

Context

https://github.com/Galileo-Galilei/kedro-mlflow/blob/64b8e94e1dafa02d979e7753dab9b9dfd4d7341c/kedro_mlflow/io/artifacts/mlflow_artifact_dataset.py#L48

Screenshot 2024-06-05 at 14 21 42

https://kedro-mlflow.readthedocs.io/en/stable/source/07_python_objects/01_DataSets.html

Screenshot 2024-06-05 at 15 08 10

Screenshot 2024-06-05 at 16 33 40