Open nateraw opened 1 year ago
@NielsRogge @rwightman may have additional requirements regarding this feature.
When adding a new (decodable) type, the hardest part is choosing the right decoding library. What I mean by "right" here is that it has all the features we need and is easy to install (with GPU support?).
Some candidates/options:
decord
: no longer maintained, not trivial to install with GPU supportpyAV
: used for CPU decoding in torchvision
, GPU decoding not supported if I'm not mistaken, otherwise the best candidate probablyvideo_reader
: used for GPU decoding in torchvision
, depends on `torch'ffmpeg
for video decoding under the hoodAnd the last resort is building our own library, which is the most flexible solution but also requires the most work.
PS: I'm adding a link to an article that compares various video decoding libraries: https://towardsdatascience.com/lightning-fast-video-reading-in-python-c1438771c4e6
@mariosasko is GPU decoding a hard requirement here? Do we really need it? (I don't know)
Something to consider with decord
is that it doesn't (AFAIK) support writing videos, so you'd still need something else for that. also I've noticed issues with decord's ability to decode stereo audio streams along side the video (which you don't run into with PyAV).
I think PyAV should be able to do the job just fine to start. If we write the video io utilities as their own functions, we can hot swap them later if we find/write a different solution that's faster/better.
Video is still a bit of a mess, but I'd say pyAV is likely the best approach (or supporting all three via pytorchvideo, but that adds a middle man dependency).
Being able to decode on the GPU, into memory that could be passed off to a Tensor in whatever framework is being used would be the dream, I don't think there is any interop of that nature working right now. Number of decoder instances per GPU is limited so it's not clear if balancing load btw GPU decoders and CPUs would be needed in say large scale video training.
Any of these solutions is less than ideal due to the nature of video, having a simple Python interface video / start -> end results in lots of extra memory (you need to decode whole range of the clips into a buffer before using anything). Any scalable video system would be streaming on the fly (issuing frames via callbacks as soon as the stream is far enough along to have re-ordered the frames and synced audio+video+other metadata (sensors, CC, etc).
For standalone usage, decoding on GPU could be ideal but isn't async processing of inputs on CPUs while letting the accelerator busy for training the de-facto? Of course, I am aware of other advanced mechanisms such as CPU offloading, but I think my point is conveyed.
Also wanted to note I added a PR for video classification in transformers
here, which uses decord
. It's still open...should we make a decision now to align the libraries we are using between datasets
and transformers
? (CC @Narsil )
Fully agree on at least trying to unite things.
Making clear function boundaries to help us change dependency if needed seems like a good idea since there doesn't seem to be a clear winner.
I also happen to like directly calling ffmpeg. For some reason it was a lot faster than pyav.
Feature request
Add a
Video
feature to the library so folks can include videos in their datasets.Motivation
Being able to load Video data would be quite helpful. However, there are some challenges when it comes to videos:
Your contribution
I did work on this a while back in this (now closed) PR. It used a library I made called encoded_video, which is basically the utils from pytorchvideo, but without the
torch
dep. It included the ability to read/write from bytes, as we need to do here. We don't want to be using a sketchy library that I made as a dependency in this repo, though.Would love to use this issue as a place to:
CC @sayakpaul @mariosasko @fcakyon