Open inigohidalgo opened 5 months ago
@inigohidalgo Are you using the CachedDataset in your work? And if yes, what are you using it to do? We're not sure about how often it's used. It's one of the older features of Kedro, so we're always open to understanding how it should be helpful.
Hi @yetudada, we use it for one specific pipeline in one project. Very niche requirement so I'm not surprised it isn't more widely used. I mentioned in Slack I consider it an antipattern we are supporting more than anything else. https://linen-slack.kedro.org/t/16408833/hiya-is-https-github-com-deepyaman-kedro-accelerator-still-s
Basically we are running some pipelines which extract some "live" data and append it to a table, but then we want to do some further downstream processing with that extracted batch as the input to some market processes, but we don't want to reload from the saved dataset as we are saving to a big table and our I/O isn't super fast. kedro-accelerator covered part of the same usecase but went further.
This pipeline could relatively-trivially be reworked to not have this requirement, but CachedDataset was the exact functionality we needed in that case.
If you don't plan on supporting CachedDataset longterm feel free to close the issue, I mostly opened it "for reference" as i mentioned in that slack thread.
EDIT: the reason I consider it an antipattern is that the return of loading the dataset can be totally different than what is returned by the actual node, this is quite common with deferred loading libraries like ibis and polars. this makes the pipeline structure very brittle.
EDIT2: maybe related #3578
IMO the use case of CacheDataset
or kedro-accelerator
are valid. It accelerate the pipelines so it's performance gain without much penalty. The downside is as @inigohidalgo described, if it's not used properly it may cause the pipeline less reproducible.
Consider this classic example:
# A Pandas dataframe
df1.to_csv("raw.csv")
df2= pd.read_csv("raw.csv") # Skip by CacheDataset or `kedro-accelerator`
Depending how dataframe looks like, df1
may not be identical as df2
due to bad typing, this is a lot less common if one is using stronger type format like parquet
. So in this case, the cache approach may "hide" this problem until someone try to put this into production.
In a lot case, it makes sense to use cache because the data is in memory, it doesn't make sense to throw it away and re-load it from disk (takes a lot of I/O time if the data is big)
Your example is a less extreme version of the problem I described. In your case you could still write a reproducible pipeline by explicitly casting the data to the correct pandas types.
(For clarity as there is duplicated terminology, KeDataset
means a kedro Dataset
implementation, PaDataset
means the Pyarrow Dataset object)
We have a KeDataset
implementation which takes a pandas dataframe and saves it to parquet using a PaDataset
. KeDataset.load
returns a PaDataset
instance which we filter and finally convert into a pandas dataframe again.
So when loading these KeDataset
s we have some nodes which are meant to specifically filter PaDataset
s. If we cached that data, the object being passed into the node would be a pandas dataframe instead of the expected type which would totally break the pipeline.
So in this case, the act of skipping loading the dataset actually returns a totally-incompatible object.
Description
In my process of researching smth the discoverability of CachedDataset functionality was a bit low. There isn't any reference to it in the "main" documentation.
I understand from @noklam that it's not a particularly widely-used feature though.
Documentation page (if applicable)
Context