kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.49k stars 874 forks source link

Typically expect catalog entries to have unique filepaths, protecting against overwrite #3993

Open david-stanley-94 opened 5 days ago

david-stanley-94 commented 5 days ago

Description

Data has been accidentally overwritten in the past, after copy pasting a catalog entry for deriving a new one, and forgetting to change the filepath. I feel it would be useful to protect against this kind of situation by expecting catalog entries to have unique filepaths by default, and throwing an error when this is not the case, with certain sensible opt outs the user / developer can add.

Context

This would prevent some accidental overwriting of data by users, while still allowing unchanged functionality for when catalog entries are expected to share filepaths (e.g. SQLDatasets, transcoded entries).

Possible Implementation

By default check and throw an error for duplicate filepaths across the entire catalog, with the following exceptions

So for a catalog.yml with:

my_first_csv_dataset:
  type: pandas.CSVDataset
  filepath: path/to/csv

my_first_edited_csv_dataset:
  type: pandas.CSVDataset
  filepath: path/to/csv

my_first_alt_edited_csv_dataset:
  type: pandas.CSVDataset
  filepath: path/to/csv
  overwrite: True

my_second_csv_dataset@pandas:
  type: pandas.CSVDataset
  filepath: path/to/second/csv

my_second_csv_dataset@spark:
  type: spark.SparkDataset
  filepath: path/to/second/csv

my_sql_dataset:
  type: [SQLDataset]
  filpepath: path/to/table

my_edited_sql_dataset:
  type: [SQLDataset]
  filpepath: path/to/table

my_alt_edited_sql_dataset:
  type: [SQLDataset]
  filpepath: path/to/table
  overwrite: False

There would be

Possible Alternatives

Add a flag for running with no duplicate filepaths expected. Throw an error if they are detected, otherwise don't. Could make this default behaviour at a later date if sees popular use. However, this is not a versatile solution, as some pipelines may have a mixture of catalog entires they would and would not expect to be overwritten.

datajoely commented 5 days ago

I'm trying to think about how this could work - as part of @ElenaKhaustova and @iamelijahko 's excellent DataCatalog research (https://github.com/kedro-org/kedro/issues/3934) there is now an initiative to make a consistent API for datasets to expose the file path as a public method: https://github.com/kedro-org/kedro/issues/3929

I think once the public API ticket is in, it would be really easy to write some sort of after_catalog_created validation hook to where you just collect all the filepath attributes and throw an error if you see more than one instance. The only complications I can maybe see in this pattern is ensuring we validate the rendered file path at runtime rather than any templated / factory file paths which are expressed differently at rest.

Davide-Ragazzon commented 5 days ago

Maybe these checks could be preformed by a separate optional function that does catalog validation.

This

E.g.: Sometimes it is useful to have e.g. a common way to update datasets in kedro is to define "input_dataset" and "updated_dataset" pointing to the same file, so you can have a function that take.