kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.5k stars 876 forks source link

CachedDataset example usage #3616

Open inigohidalgo opened 5 months ago

inigohidalgo commented 5 months ago

Description

In my process of researching smth the discoverability of CachedDataset functionality was a bit low. There isn't any reference to it in the "main" documentation.

I understand from @noklam that it's not a particularly widely-used feature though.

Documentation page (if applicable)

Context

yetudada commented 4 months ago

@inigohidalgo Are you using the CachedDataset in your work? And if yes, what are you using it to do? We're not sure about how often it's used. It's one of the older features of Kedro, so we're always open to understanding how it should be helpful.

inigohidalgo commented 4 months ago

Hi @yetudada, we use it for one specific pipeline in one project. Very niche requirement so I'm not surprised it isn't more widely used. I mentioned in Slack I consider it an antipattern we are supporting more than anything else. https://linen-slack.kedro.org/t/16408833/hiya-is-https-github-com-deepyaman-kedro-accelerator-still-s

Basically we are running some pipelines which extract some "live" data and append it to a table, but then we want to do some further downstream processing with that extracted batch as the input to some market processes, but we don't want to reload from the saved dataset as we are saving to a big table and our I/O isn't super fast. kedro-accelerator covered part of the same usecase but went further.

This pipeline could relatively-trivially be reworked to not have this requirement, but CachedDataset was the exact functionality we needed in that case.

If you don't plan on supporting CachedDataset longterm feel free to close the issue, I mostly opened it "for reference" as i mentioned in that slack thread.

EDIT: the reason I consider it an antipattern is that the return of loading the dataset can be totally different than what is returned by the actual node, this is quite common with deferred loading libraries like ibis and polars. this makes the pipeline structure very brittle.

EDIT2: maybe related #3578

noklam commented 4 months ago

IMO the use case of CacheDataset or kedro-accelerator are valid. It accelerate the pipelines so it's performance gain without much penalty. The downside is as @inigohidalgo described, if it's not used properly it may cause the pipeline less reproducible.

Consider this classic example:

# A Pandas dataframe
df1.to_csv("raw.csv")
df2= pd.read_csv("raw.csv") # Skip by CacheDataset or `kedro-accelerator` 

Depending how dataframe looks like, df1 may not be identical as df2 due to bad typing, this is a lot less common if one is using stronger type format like parquet. So in this case, the cache approach may "hide" this problem until someone try to put this into production.

In a lot case, it makes sense to use cache because the data is in memory, it doesn't make sense to throw it away and re-load it from disk (takes a lot of I/O time if the data is big)

inigohidalgo commented 4 months ago

Your example is a less extreme version of the problem I described. In your case you could still write a reproducible pipeline by explicitly casting the data to the correct pandas types.

(For clarity as there is duplicated terminology, KeDataset means a kedro Dataset implementation, PaDataset means the Pyarrow Dataset object)

We have a KeDataset implementation which takes a pandas dataframe and saves it to parquet using a PaDataset. KeDataset.load returns a PaDataset instance which we filter and finally convert into a pandas dataframe again.

So when loading these KeDatasets we have some nodes which are meant to specifically filter PaDatasets. If we cached that data, the object being passed into the node would be a pandas dataframe instead of the expected type which would totally break the pipeline.

So in this case, the act of skipping loading the dataset actually returns a totally-incompatible object.