takikadiri / kedro-boot

A kedro plugin that streamlines the integration between Kedro projects and third-party applications, making it easier for you to develop end-to-end production-ready data science applications.
Apache License 2.0
33 stars 6 forks source link

Is dataset caching persistent across runs? #32

Open charlesbmi opened 1 month ago

charlesbmi commented 1 month ago

Thanks for creating this awesome project! I am excited to use it as a plugin for Kedro pipeline-parameter sweeps (e.g. via Hydra or Optuna).

I was interested in this point in the README:

you can run the same session multiple times with many speed optimisation (including dataset caching)

but I couldn't find any information about it in the code-base. Is it implemented? If so, is the dataset cached to disk across session runs, or is it just kedro.io.CachedDataSet under the hood?

takikadiri commented 1 month ago

Hi charlesbmi, i'm glad you liked the project !

Yes the cached dataset is persisted across runs. kedro-boot cache/preload some datastes as MemoryDataset in order to speedup the runs and achieve low latency. The process of preparing the catalog for multiple runs process is called catalog compilation. You can dry run the compilation process with kedro boot compile --pipeline your_pipeline, the list of artifact datasets that would be cached are described in the compilation report.

In your use case, you would have a thin application that inject some parameters into your pipelines, kedro-boot would preload all other datasets as MemoryDataset as they are not changed between runs.

Let us know if it's worked for you.