Open stefanwayon opened 3 years ago
Hi,
Thanks for bringing up this issue!
Our current thinking is that most (if not all?) filters in ffmpeg can be implemented with basic python / PyTorch / torchvision operators without too much loss of speed efficiency, and as such there would be limited benefit in packaging the filter logic from ffmpeg in PyTorch (as we would not have GPU / gradient support out of the box).
Your points about the speed for resizing are valid though, and I believe this relates to a current limitation of torchvision resize
transform: it currently converts the input tensors to fp32, even if the input is uint8. This means that there is a significant cost of performing resize on single frames compared to alternative implementations which work directly on uint8.
This is something which we plan to improve in the future, which would bring the speed of resizing frames in torchvision to be similar to ffmpeg / opencv.
For the change of framerate, the results you present are interesting, and I wasn't expecting such a large difference. Could you make the script that you used to obtain those results available so that we can have a look?
Thoughts?
Hi @fmassa,
Thanks for your reply! I put the relevant code into a Colab MWE. The numbers are different compared to the plots above, but they tell the same story.
I agree that the framerate results are a bit strange. The behaviour I would expect is more like the Torchvision curve – a constant time to decode a video clip and a negligible overhead to duplicate/drop frames, so the fps linearly scales with the output frame rate. I’m not entirely sure how ffmpeg scales the way it does, I’ll take a look and see if I find anything.
Edit: I suppose for the ffmpeg case, dropping/duplicating frames does have some non-negligible overhead, as each frame has to be read over a pipe from the ffmpeg process, even if it is a duplicate. In light of that, the ffmpeg curve makes sense.
Thanks for the notebooks @SliMM !
I'm still a bit surprised that the 5s resampling example shows FFmpegVideoReader being much faster. Maybe what's going on is that for that video length and Hz, we can jump to keyframes and then do just a few frame decodings directly for faster reading? If that's the case, then this is for now not something that we support in the video reader in torchvision, but it's in the plans. @bjuncek can you have a look to double-check?
Hi @SliMM - thanks a lot for the notebooks, and sorry for the late reply - I've been OOF for the last few days. I'll take a look at this first thing next week.
My initial thoughts are:
I'll test this out a bit further to see if they are doing something different to us, and if maybe resampling would be beneficial to implement in our low level API.
Thanks again, Bruno
🚀 Feature
Add support for (basic) FFmpeg filters for faster video pre-processing. In particular, rescaling and changing the frame rate would be useful when feeding in-the-wild videos through a trained model.
Motivation
I am working on a video loader to feed video frames to a model trained on the Kinetics 400 dataset and obtain predictions. The model is trained at a fixed resolution, on videos with a frame rate of 15fps. To support making predictions on videos from various sources, I at least need to resample them at the correct resolution and frame rate.
The current public API only supports decoding of video frames and trimming, but not any other pre-processing, so I need to do any such pre-processing in Python/PyTorch. Such an approach is visibly slower when compared to an implementation based on
ffmpeg-python
– a wrapper around the command lineffmpeg
. For some stats, see Additional context.Pitch
I would like to start a conversation on how best to bring such functionality to Torchvision. I imagine changing the resolution/fps is a common requirement for making predictions on videos, so I can see it as a useful feature of video I/O. Looking at the C++ code, there is already some support for requesting video frames of a certain resolution [1][2], but this functionality is only exposed in
torch.ops.video_reader.read_video_from_file
, not the public API. I can’t find anything similar for requesting a certain frame rate.Is this something that you would want to add to
torchvision.io.read_video
? What about totorchvision.io.VideoReader
? More generally, is there a plan to add support for all FFmpeg filters in the future? What would that interface look like?Additional context
I’ve done some initial comparisons between
torchvision.io.VideoReader
+ changing frame rate in Python +torch
rescaling on batches of 16 frames versus affmpeg-python
pipeline withscale
andfps
filters on a 854x480@30fps MP4 input video of ~261s. I’ve included the results below.Decoding the first seconds of a clip (output fps=15, output size=input size):
Decoding 1s of video for given start time (output fps=15, output size=input size):
Changing the framerate for the first 1s of video (output size=input size):
Changing the framerate for the first 5s of video (output size=input size):
Rescaling the first 1s of video (output fps=15):
Rescaling the first 1s of video with
bilinear-fast
FFMpeg algorithm (output fps=15):Rescaling the first 5s of video (output fps=15):
cc @bjuncek