To speed up training, it could be useful to write some examples or batches to disk to load quickly. There are built-in datapipes that can be used to cache outputs from datapipes, so we could do it for only certain parts or the end torch Tensors.
We still have an issue with loading speed of Zarr-based data. This could be used in a few ways. One would be generate a dataset to initially train models on, and then switch to training from the full pipeline as more fine-tuning, even though its slower. Or mix in loading examples off disk and loading them from raw data.
To speed up training, it could be useful to write some examples or batches to disk to load quickly. There are built-in datapipes that can be used to cache outputs from datapipes, so we could do it for only certain parts or the end torch Tensors.
Detailed Description
OnDiskCacheHolder EndDiskCacheHolder
Context
We still have an issue with loading speed of Zarr-based data. This could be used in a few ways. One would be generate a dataset to initially train models on, and then switch to training from the full pipeline as more fine-tuning, even though its slower. Or mix in loading examples off disk and loading them from raw data.
Possible Implementation