huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.29k stars 2.7k forks source link

new dataset type: single-label and multi-label video classification #5032

Open fcakyon opened 2 years ago

fcakyon commented 2 years ago

Is your feature request related to a problem? Please describe. In my research, I am dealing with multi-modal (audio+text+frame sequence) video classification. It would be great if the datasets library supported generating multi-modal batches from a video dataset.

Describe the solution you'd like Assume I have video files having single/multiple labels. I want to train a single/multi-label video classification model. I want datasets to support generating multi-modal batches (audio+frame sequence) from video files. Audio waveform and frame sequence can be extracted from each video clip then I can use any audio, image and video model from transformers library to extract features which will be fed into my model.

Describe alternatives you've considered Currently, I am using https://github.com/facebookresearch/pytorchvideo dataloaders. There seems to be not much alternative.

Additional context I am wiling to open a PR but don't know where to start.

lhoestq commented 2 years ago

Hi ! You can in the features folder how we implemented the audio and image feature types.

We can have something similar to videos. What we need to decide:

also cc @nateraw who also took a look at what we can do for video

sayakpaul commented 2 years ago

@lhoestq @nateraw is there any progress on adding video classification datasets?

lhoestq commented 2 years ago

Hi ! I think we just missing which lib we're going to use to decode the videos + which parameters must go in the Video type

sayakpaul commented 2 years ago

Hmm. decord could be nice but it's no longer maintained it seems.

fcakyon commented 2 years ago

pytorchvideo uses pyav as the default decoder: https://github.com/facebookresearch/pytorchvideo/blob/c8d23d8b7e597586a9e2d18f6ed31ad8aa379a7a/pytorchvideo/data/labeled_video_dataset.py#L37

Also it would be great if optionally audio can also be decoded from the video as in pytorchvideo: https://github.com/facebookresearch/pytorchvideo/blob/c8d23d8b7e597586a9e2d18f6ed31ad8aa379a7a/pytorchvideo/data/labeled_video_dataset.py#L35

Here are the other decoders supported in pytorchvideo: https://github.com/facebookresearch/pytorchvideo/blob/c8d23d8b7e597586a9e2d18f6ed31ad8aa379a7a/pytorchvideo/data/encoded_video.py#L17

nateraw commented 2 years ago

@sayakpaul I did do quite a bit of work on this PR a while back to add a video feature. It's outdated, but uses my encoded_video package under the hood, which is basically a wrapper around PyAV stolen from pytorchvideo that gets rid of the torch dependency.

would be really great to get something like this in...it's just a really tricky and time consuming feature to add.