Open AntreasAntoniou opened 1 year ago
Hi!
You want to use push_to_hub
(creates Parquet files) instead of save_to_disk
(creates Arrow files) when creating a Hub dataset. Parquet is designed for long-term storage and takes less space than the Arrow format, and, most importantly, load_dataset
can parse it, which should fix the viewer.
Regarding the dataset generation, Dataset.from_generator
with the video data represented as datasets.Value("binary")
followed by push_to_hub
should work (if the push_to_hub
step times out, restart it to resume uploading)
PS: Once the dataset is uploaded, to make working with the dataset easier, it's a good idea to add a transform to the README that shows how to decode the binary video data into something a model can understand. Also, if you get an ArrowInvalid
error (can happen when working with large binary data) in Dataset.from_generator
, reduce the value of writer_batch_size
(the default is 1000) to fix it.
One issue here is that Dataset.from_generator can work well for the non 'infinite sampling' version of the dataset. The training set for example is often sampled dynamically given the video files that I have uploaded. I worry that storing the video data as binary means that I'll end up duplicating a lot of the data. Furthermore, storing video data as anything but .mp4 would quickly make the dataset size from 1.9TB to 1PB.
storing video data as anything but .mp4
What I mean by storing as datasets.Value("binary")
is embedding raw MP4 bytes in the Arrow table, but, indeed, this would waste a lot of space if there are duplicates.
So I see two options:
map
in the README that samples the video data to "unpack" the samples)Also, I misread MP4 as MP3. We need to add a Video
feature to the datasets
lib to support MP4 files in the viewer (a bit trickier to implement than the Image
feature due to the Arrow limitations).
I'm transferring this issue to the datasets
repo, as it's not related to huggingface_hub
@mariosasko Right. If I want my dataset to be streamable, what are the necessary requirements to achieve that within the context of .mp4 binaries like we have here? I guess your second point here would not support that right?
The streaming would work, but the video paths would require using fsspec.open
to get the content.
Are there any plans to make video playable on the hub?
Not yet. The (open source) tooling for video is not great in terms of ease of use/performance, so we are discussing internally the best way to support it (one option is creating a new library for video IO, but this will require a lot of work)
True. I spend a good 4 months just mixing and matching existing solutions so I could get performance that would not IO bound my model training.
This is what I ended up with, in case it's useful
Is your feature request related to a problem? Please describe. I recently chose to use huggingface hub as the home for a large multi modal dataset I've been building. https://huggingface.co/datasets/Antreas/TALI
It combines images, text, audio and video. Now, I could very easily upload a dataset made via datasets.Dataset.from_generator, as long as it did not include video files. I found that including .mp4 files in the entries would not auto-upload those files.
Hence I tried to upload them myself. I quickly found out that uploading many small files is a very bad way to use git lfs, and that it would take ages, so, I resorted to using 7z to pack them all up. But then I had a new problem.
My dataset had a size of 1.9TB. Trying to upload such a large file with the default huggingface_hub API always resulted in time outs etc. So I decided to split the large files into chunks of 5GB each and reupload.
So, eventually it all worked out. But now the dataset can't be properly and natively used by the datasets API because of all the needed preprocessing -- and furthermore the hub is unable to visualize things.
Describe the solution you'd like A native way to upload large datasets that include .mp4 or other video types.
Describe alternatives you've considered Already explained earlier
Additional context https://huggingface.co/datasets/Antreas/TALI