Handle huggingface images/audio objects as files

dberenbaum commented 2 months ago

I think it also makes sense to revisit caching and storing objects in a follow-up. The image and audio objects are heavy and the approach in this PR is suboptimal:

storing them as bytes in the warehouse takes up a lot of storage and may be slow to read/write
this goes against datachain philosophy of leaving heavy objects in place and reading directly from them as needed
this also goes against the hf datasets philosophy of saving to and reading from memory-mapped arrow files There are caches available in both datachain and datasets, so we should consider how to use these effectively as an alternative to storing the bytes in the warehouse (for example, we could load these objects directly from cached files if we know the file, row, and column to access the object).

Originally posted by @dberenbaum in https://github.com/iterative/datachain/issues/311#issuecomment-2313106909

dberenbaum commented 2 months ago

Part of #236

dberenbaum commented 1 month ago

See #396

iterative / datachain

Handle huggingface images/audio objects as files #369