I think it also makes sense to revisit caching and storing objects in a follow-up. The image and audio objects are heavy and the approach in this PR is suboptimal:
storing them as bytes in the warehouse takes up a lot of storage and may be slow to read/write
this goes against datachain philosophy of leaving heavy objects in place and reading directly from them as needed
this also goes against the hf datasets philosophy of saving to and reading from memory-mapped arrow files
There are caches available in both datachain and datasets, so we should consider how to use these effectively as an alternative to storing the bytes in the warehouse (for example, we could load these objects directly from cached files if we know the file, row, and column to access the object).
I think it also makes sense to revisit caching and storing objects in a follow-up. The image and audio objects are heavy and the approach in this PR is suboptimal:
datachain
philosophy of leaving heavy objects in place and reading directly from them as neededdatasets
philosophy of saving to and reading from memory-mapped arrow files There are caches available in bothdatachain
anddatasets
, so we should consider how to use these effectively as an alternative to storing the bytes in the warehouse (for example, we could load these objects directly from cached files if we know the file, row, and column to access the object).Originally posted by @dberenbaum in https://github.com/iterative/datachain/issues/311#issuecomment-2313106909