Add video feature - Githubissues

huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools

https://huggingface.co/docs/datasets

Apache License 2.0

19.19k stars 2.68k forks source link

Add video feature #5225

Open nateraw opened 1 year ago

nateraw commented 1 year ago

Feature request

Add a Video feature to the library so folks can include videos in their datasets.

Motivation

Being able to load Video data would be quite helpful. However, there are some challenges when it comes to videos:

Videos, unlike images, can end up being extremely large files
Often times when training video models, you need to do some very specific sampling. Videos might end up needing to be broken down into X number of clips used for training/inference
Videos have an additional audio stream, which must be accounted for
The feature needs to be able to encode/decode videos (with right video settings) from bytes.

Your contribution

I did work on this a while back in this (now closed) PR. It used a library I made called encoded_video, which is basically the utils from pytorchvideo, but without the torch dep. It included the ability to read/write from bytes, as we need to do here. We don't want to be using a sketchy library that I made as a dependency in this repo, though.

Would love to use this issue as a place to:

brainstorm ideas on how to do this right
list ways/examples to work around it for now

CC @sayakpaul @mariosasko @fcakyon

mariosasko commented 1 year ago

@NielsRogge @rwightman may have additional requirements regarding this feature.

When adding a new (decodable) type, the hardest part is choosing the right decoding library. What I mean by "right" here is that it has all the features we need and is easy to install (with GPU support?).

Some candidates/options:

decord: no longer maintained, not trivial to install with GPU support
pyAV: used for CPU decoding in torchvision, GPU decoding not supported if I'm not mistaken, otherwise the best candidate probably
video_reader: used for GPU decoding in torchvision, depends on `torch'
OpenCV: uses ffmpeg for video decoding under the hood
...

And the last resort is building our own library, which is the most flexible solution but also requires the most work.

PS: I'm adding a link to an article that compares various video decoding libraries: https://towardsdatascience.com/lightning-fast-video-reading-in-python-c1438771c4e6

nateraw commented 1 year ago

@mariosasko is GPU decoding a hard requirement here? Do we really need it? (I don't know)

Something to consider with decord is that it doesn't (AFAIK) support writing videos, so you'd still need something else for that. also I've noticed issues with decord's ability to decode stereo audio streams along side the video (which you don't run into with PyAV).

I think PyAV should be able to do the job just fine to start. If we write the video io utilities as their own functions, we can hot swap them later if we find/write a different solution that's faster/better.

rwightman commented 1 year ago

Video is still a bit of a mess, but I'd say pyAV is likely the best approach (or supporting all three via pytorchvideo, but that adds a middle man dependency).

Being able to decode on the GPU, into memory that could be passed off to a Tensor in whatever framework is being used would be the dream, I don't think there is any interop of that nature working right now. Number of decoder instances per GPU is limited so it's not clear if balancing load btw GPU decoders and CPUs would be needed in say large scale video training.

Any of these solutions is less than ideal due to the nature of video, having a simple Python interface video / start -> end results in lots of extra memory (you need to decode whole range of the clips into a buffer before using anything). Any scalable video system would be streaming on the fly (issuing frames via callbacks as soon as the stream is far enough along to have re-ordered the frames and synced audio+video+other metadata (sensors, CC, etc).

sayakpaul commented 1 year ago

For standalone usage, decoding on GPU could be ideal but isn't async processing of inputs on CPUs while letting the accelerator busy for training the de-facto? Of course, I am aware of other advanced mechanisms such as CPU offloading, but I think my point is conveyed.

nateraw commented 1 year ago

Here's a minimal implementation of the helper functions we'd need from PyAV, a lot of which I borrowed from pytorchvideo, stripping out the torch specific stuff:

It's not too much code...@mariosasko we could probably just maintain these helper fns within the datasets library, right?

nateraw commented 1 year ago

Also wanted to note I added a PR for video classification in transformers here, which uses decord. It's still open...should we make a decision now to align the libraries we are using between datasets and transformers? (CC @Narsil )

https://github.com/huggingface/transformers/pull/20151

Narsil commented 1 year ago

Fully agree on at least trying to unite things.

Making clear function boundaries to help us change dependency if needed seems like a good idea since there doesn't seem to be a clear winner.

I also happen to like directly calling ffmpeg. For some reason it was a lot faster than pyav.