Open fcakyon opened 2 years ago
Hi ! You can in the features
folder how we implemented the audio and image feature types.
We can have something similar to videos. What we need to decide:
Video()
feature type needsalso cc @nateraw who also took a look at what we can do for video
@lhoestq @nateraw is there any progress on adding video classification datasets?
Hi ! I think we just missing which lib we're going to use to decode the videos + which parameters must go in the Video
type
pytorchvideo uses pyav as the default decoder: https://github.com/facebookresearch/pytorchvideo/blob/c8d23d8b7e597586a9e2d18f6ed31ad8aa379a7a/pytorchvideo/data/labeled_video_dataset.py#L37
Also it would be great if optionally
audio can also be decoded from the video as in pytorchvideo: https://github.com/facebookresearch/pytorchvideo/blob/c8d23d8b7e597586a9e2d18f6ed31ad8aa379a7a/pytorchvideo/data/labeled_video_dataset.py#L35
Here are the other decoders supported in pytorchvideo: https://github.com/facebookresearch/pytorchvideo/blob/c8d23d8b7e597586a9e2d18f6ed31ad8aa379a7a/pytorchvideo/data/encoded_video.py#L17
@sayakpaul I did do quite a bit of work on this PR a while back to add a video feature. It's outdated, but uses my encoded_video
package under the hood, which is basically a wrapper around PyAV stolen from pytorchvideo that gets rid of the torch
dependency.
would be really great to get something like this in...it's just a really tricky and time consuming feature to add.
Is your feature request related to a problem? Please describe. In my research, I am dealing with multi-modal (audio+text+frame sequence) video classification. It would be great if the datasets library supported generating multi-modal batches from a video dataset.
Describe the solution you'd like Assume I have video files having single/multiple labels. I want to train a single/multi-label video classification model. I want datasets to support generating multi-modal batches (audio+frame sequence) from video files. Audio waveform and frame sequence can be extracted from each video clip then I can use any audio, image and video model from transformers library to extract features which will be fed into my model.
Describe alternatives you've considered Currently, I am using https://github.com/facebookresearch/pytorchvideo dataloaders. There seems to be not much alternative.
Additional context I am wiling to open a PR but don't know where to start.