kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.53k stars 877 forks source link

Improve MemoryDataset copy_mode selection #3551

Open inigohidalgo opened 6 months ago

inigohidalgo commented 6 months ago

Description

Following from discussion in Slack

The way MemoryDataset decides which of the 3 copy modes to use is very straightforward, but in my opinion is lacking 4 years after its original implementation.

If no parameter is passed, it uses the function _infer_copy_mode

In the case of Pandas dataframes and (I think) Spark dataframes, this is the correct logic, but is too restrictive in its logic to make everything else deepcopy by default.

This is a problem when using Ibis Tables, which are not serializable, as discussed here https://github.com/kedro-org/kedro/issues/2374#issuecomment-1593200758

The solution is simple: to add the dataset to the catalog and manually set it to be a MemoryDataset with copy_mode: assign, but this rapidly becomes cumbersome when passing tables between many nodes, and becomes an unmaintainable hell.

I would like to be able to influence the logic in _infer_copy_mode by either

@noklam correctly pointed out this last one could be done using dataset factories like so

# catalog.yml

{_catch_all_}:
  type: my.own.MemoryDataset # with different _infer_copy_mode()
OR
  copy_mode: "assign"

OR

{table_name#_ibis}: # haven't used factories yet myself, quoting the syntax from memory
  type: MemoryDataset
  copy_mode: "assign"

But we all agreed this wasn't the most ergonomic way to do it, as it forces you to do it in every project.

I would welcome your thoughts on this or any solutions I may have missed (some sort of hook?)

datajoely commented 6 months ago

I'd also highlight this line intended detect if we're talking about a Spark dataframe or not is probably quite fragile and is possibly a relic of assumptions that made sense a few years ago but not anymore...

https://github.com/kedro-org/kedro/blob/e187dca1a5546a0a87f609682e87245f362e65f8/kedro/io/memory_dataset.py#L106

inigohidalgo commented 6 months ago

https://github.com/kedro-org/kedro/blob/e187dca1a5546a0a87f609682e87245f362e65f8/kedro/io/memory_dataset.py#L60-L70

@deepyaman I just realized that subclassing wouldn't be as easy an option as I thought as _infer_copy_mode is a function, not a method, and it's called within the _load and _save methods, so would need to override both of them, unless you preemptively set the copy_mode in the __init__ of the subclass