A way to upload and visualize .mp4 files (millions of them) as part of a dataset

huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools

https://huggingface.co/docs/datasets

Apache License 2.0

18.8k stars 2.6k forks source link

A way to upload and visualize .mp4 files (millions of them) as part of a dataset #5888

Open AntreasAntoniou opened 1 year ago

AntreasAntoniou commented 1 year ago

Is your feature request related to a problem? Please describe. I recently chose to use huggingface hub as the home for a large multi modal dataset I've been building. https://huggingface.co/datasets/Antreas/TALI

It combines images, text, audio and video. Now, I could very easily upload a dataset made via datasets.Dataset.from_generator, as long as it did not include video files. I found that including .mp4 files in the entries would not auto-upload those files.

Hence I tried to upload them myself. I quickly found out that uploading many small files is a very bad way to use git lfs, and that it would take ages, so, I resorted to using 7z to pack them all up. But then I had a new problem.

My dataset had a size of 1.9TB. Trying to upload such a large file with the default huggingface_hub API always resulted in time outs etc. So I decided to split the large files into chunks of 5GB each and reupload.

So, eventually it all worked out. But now the dataset can't be properly and natively used by the datasets API because of all the needed preprocessing -- and furthermore the hub is unable to visualize things.

Describe the solution you'd like A native way to upload large datasets that include .mp4 or other video types.

Describe alternatives you've considered Already explained earlier

Additional context https://huggingface.co/datasets/Antreas/TALI

mariosasko commented 1 year ago

Hi!

You want to use push_to_hub (creates Parquet files) instead of save_to_disk (creates Arrow files) when creating a Hub dataset. Parquet is designed for long-term storage and takes less space than the Arrow format, and, most importantly, load_dataset can parse it, which should fix the viewer.

Regarding the dataset generation, Dataset.from_generator with the video data represented as datasets.Value("binary") followed by push_to_hub should work (if the push_to_hub step times out, restart it to resume uploading)

PS: Once the dataset is uploaded, to make working with the dataset easier, it's a good idea to add a transform to the README that shows how to decode the binary video data into something a model can understand. Also, if you get an ArrowInvalid error (can happen when working with large binary data) in Dataset.from_generator, reduce the value of writer_batch_size (the default is 1000) to fix it.

AntreasAntoniou commented 1 year ago

One issue here is that Dataset.from_generator can work well for the non 'infinite sampling' version of the dataset. The training set for example is often sampled dynamically given the video files that I have uploaded. I worry that storing the video data as binary means that I'll end up duplicating a lot of the data. Furthermore, storing video data as anything but .mp4 would quickly make the dataset size from 1.9TB to 1PB.

mariosasko commented 1 year ago

storing video data as anything but .mp4

What I mean by storing as datasets.Value("binary") is embedding raw MP4 bytes in the Arrow table, but, indeed, this would waste a lot of space if there are duplicates.

So I see two options:

if one video is not mapped to too many samples, you can embed the video bytes and do "group by" on the rest of the columns (this would turn them into lists) to avoid duplicating them (then, it should be easy to define a map in the README that samples the video data to "unpack" the samples)
you can create a dataset script that downloads the video files and embeds their file paths into the Arrow file

Also, I misread MP4 as MP3. We need to add a Video feature to the datasets lib to support MP4 files in the viewer (a bit trickier to implement than the Image feature due to the Arrow limitations).

mariosasko commented 1 year ago

I'm transferring this issue to the datasets repo, as it's not related to huggingface_hub

AntreasAntoniou commented 1 year ago

@mariosasko Right. If I want my dataset to be streamable, what are the necessary requirements to achieve that within the context of .mp4 binaries like we have here? I guess your second point here would not support that right?

mariosasko commented 1 year ago

The streaming would work, but the video paths would require using fsspec.open to get the content.

AntreasAntoniou commented 1 year ago

Are there any plans to make video playable on the hub?

mariosasko commented 1 year ago

Not yet. The (open source) tooling for video is not great in terms of ease of use/performance, so we are discussing internally the best way to support it (one option is creating a new library for video IO, but this will require a lot of work)

AntreasAntoniou commented 1 year ago

True. I spend a good 4 months just mixing and matching existing solutions so I could get performance that would not IO bound my model training.

This is what I ended up with, in case it's useful

https://github.com/AntreasAntoniou/TALI/blob/045cf9e5aa75b1bf2c6d5351fb910fa10e3ff32c/tali/data/data_plus.py#L85