Typically expect catalog entries to have unique filepaths, protecting against overwrite

david-stanley-94 commented 5 days ago

Description

Data has been accidentally overwritten in the past, after copy pasting a catalog entry for deriving a new one, and forgetting to change the filepath. I feel it would be useful to protect against this kind of situation by expecting catalog entries to have unique filepaths by default, and throwing an error when this is not the case, with certain sensible opt outs the user / developer can add.

Context

This would prevent some accidental overwriting of data by users, while still allowing unchanged functionality for when catalog entries are expected to share filepaths (e.g. SQLDatasets, transcoded entries).

Possible Implementation

By default check and throw an error for duplicate filepaths across the entire catalog, with the following exceptions

ignore transcoded entries (these are expected to share filepaths)
ignore overwrite: True flagged entries (or something like this)
- Might be this is a flag to add to datasets (e.g. SQLDataset) rather than catalog entries
- And then can overrule dataset setting with catalog entry flagging, so csv files say can be allowed to overwrite where desired, and sql tables can be prevented from overwrite where desired

So for a catalog.yml with:

my_first_csv_dataset:
  type: pandas.CSVDataset
  filepath: path/to/csv

my_first_edited_csv_dataset:
  type: pandas.CSVDataset
  filepath: path/to/csv

my_first_alt_edited_csv_dataset:
  type: pandas.CSVDataset
  filepath: path/to/csv
  overwrite: True

my_second_csv_dataset@pandas:
  type: pandas.CSVDataset
  filepath: path/to/second/csv

my_second_csv_dataset@spark:
  type: spark.SparkDataset
  filepath: path/to/second/csv

my_sql_dataset:
  type: [SQLDataset]
  filpepath: path/to/table

my_edited_sql_dataset:
  type: [SQLDataset]
  filpepath: path/to/table

my_alt_edited_sql_dataset:
  type: [SQLDataset]
  filpepath: path/to/table
  overwrite: False

There would be

Errors for my_first_csv_dataset and my_first_edited_csv_dataset sharing filepaths, but not for my_first_alt_edited_csv_dataset
NO errors for my_second_csv_dataset@pandas and my_second_csv_dataset@spark
An error for my_alt_edited_sql_dataset, but not for my_sql_dataset or my_alt_edited_sql_dataset

Possible Alternatives

Add a flag for running with no duplicate filepaths expected. Throw an error if they are detected, otherwise don't. Could make this default behaviour at a later date if sees popular use. However, this is not a versatile solution, as some pipelines may have a mixture of catalog entires they would and would not expect to be overwritten.

datajoely commented 5 days ago

I'm trying to think about how this could work - as part of @ElenaKhaustova and @iamelijahko 's excellent DataCatalog research (https://github.com/kedro-org/kedro/issues/3934) there is now an initiative to make a consistent API for datasets to expose the file path as a public method: https://github.com/kedro-org/kedro/issues/3929

I think once the public API ticket is in, it would be really easy to write some sort of after_catalog_created validation hook to where you just collect all the filepath attributes and throw an error if you see more than one instance. The only complications I can maybe see in this pattern is ensuring we validate the rendered file path at runtime rather than any templated / factory file paths which are expressed differently at rest.

Davide-Ragazzon commented 5 days ago

Maybe these checks could be preformed by a separate optional function that does catalog validation.

This

Allows users to validate the catalog if needed
Does not restrict the cases where multiple datasets point to the same file to a specific subsets of allowed case. This reduces the risk that users need something we are not thinking about and the whole catalog breaks by default
Does not force the users to add more flags like the overwrite flag suggested above, unless the user specifically decides to run the checks

E.g.: Sometimes it is useful to have e.g. a common way to update datasets in kedro is to define "input_dataset" and "updated_dataset" pointing to the same file, so you can have a function that take.

kedro-org / kedro