pytorch / data

A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries.
BSD 3-Clause "New" or "Revised" License
1.12k stars 149 forks source link

Add cache invalidation to OnDiskCacheHolder #1111

Open bilelomrani1 opened 1 year ago

bilelomrani1 commented 1 year ago

🚀 The feature

Add the ability to automatically invalidate a cached sub-graph when the remote files change after being cached locally.

Motivation, pitch

Say I have multiple files stored in a remote object storage. These files are fed into a datapipe using FSSpecFileLister, and cached locally using .on_disk_cache. I want to invalidate the cache and re-compute the datapipe when one or more remote files are changed, probably based on their hash.

Alternatives

No response

Additional context

This feature request originated from this conversation on the Pytorch forum.

NivekT commented 1 year ago

Note that I believe you can currently use extra_check_fn within .on_disk_cache to re-compute the hash and flag any difference. But that will not automatically delete or re-download the files.