Open bjuncek opened 4 years ago
Adding the feature request tracker:
torchvision.io
#2778 next
function as a true iteratoranyFrame
= false seek implementation (issue: #3014)Hi, @bjuncek
Regarding the seek
function, in the case that I set any_frame=False
and call it several times to get a sequence of frames, if the target frames are between the same pair of key frames, is it true that the same key frame is being repeatedly decoded?
BTW, any plans for supporting compressed visual features without decoding?
Hi @bryandeng,
At the moment, seek function only implements "precise" seek, that is any_frame=False
doesn't exist for now.
Having said that, I believe you are correct. Specifically the behaviour we're thinking of is the following:
Let's imagine there exists a pair of keyframes k_1
and k_2
with pts=1.2
and pts=2.4
.
Calling video_object.seek(2)
will demux the packet that contains k_1
regardless of how many times you call it. You can get to k_2
by either a) iterating past it by calling next(video_object)
many times, or b) seeking past 2.4 with any_frame=False
option and then calling next(video_object)
.
Also, please note that seek
will never actually return a frame. It initialises the decoder at the appropriate frame, so that it returns the frame we are asking it to at the following iteration.
Does that answer your first question?
BTW, any plans for supporting compressed visual features without decoding?
Not at the moment unfortunately. I'm not ruling it out completely, but it's out of scope for the next one or two releases for sure. Do you think there would be much demand for it?
Best, Bruno
Hi @bjuncek ,
Sorry for the late reply. Thanks for your clear explanation.
Regarding my previous question 1, will a "higher level" multiple-frame read function based on this new video API be provided, which takes a list of frame indices or pts' as input and hides the details of key frames, seeking and caching from the user? This resembles how the decord
library is designed.
And for question 2, personally speaking compressed visual features are among the top priorities after ordinary visual and acoustic features. And as far as I know, toolkits like MMAction2 are implementing them.
Our team at Tencent uses a home-made video reading library which is also FFmpeg based. It supports CPU/GPU decoding and any_frame=False
like imprecise seeking. We are motivated to contribute code according to the community's API designs.
@bryandeng Hi,
will a "higher level" multiple-frame read function based on this new video API be provided
The idea is to replace the current implementation of torchvision.io.read_video
, which takes a start_pts / end_pts and returns the frames between two two timestamps, to use VideoReader
.
which takes a list of frame indices
There is unfortunately no generic and reliable way of figuring out the number of frames in an arbitrary video extension. So the approach taken by Decord, which proposes a __getitem__
based API can miss / repeat frames for some codecs.
This means that in order to reliably provide such functionality we would first need to decode all the video (or get an estimate of the pts for each frame, which might not always be possible), and that would be very slow. I would love to be proven wrong though :-)
And for question 2, personally speaking compressed visual features are among the top priorities after ordinary visual and acoustic features. And as far as I know, toolkits like MMAction2 are implementing them
This is currently not in our roadmap, but we could consider implementing this in the future (for torchvision 0.10 or beyond most probably). Can you open a separate issue to discuss this functionality?
Our team at Tencent uses a home-made video reading library which is also FFmpeg based. It supports CPU/GPU decoding and any_frame=False like imprecise seeking
@bjuncek is currently looking into implementing any_frame=False
on the current setup, but only for CPU decoding.
GPU decoding for video is something we would love to explore, although I believe for now only a subset of the formats currently support GPU decoding.
We are motivated to contribute code according to the community's API designs
That would be great! Can you first open a separate issue for the GPU video decoding, so that we can discuss about the potential formats that would be supported etc ?
cc @takatosp1 @tullie for awareness
π Feature
We're proposing to add a lower level, more flexible, and equally robust API than the one currently existing in
torchvision
. It would be implemented in C++ and compatible with torchscript. Following the merge of https://github.com/pytorch/vision/pull/2596 it would also be installable via pip or condaMotivation
Currently, our API supports returning a tensor of
(TxCxHxW)
viaread_video
(see here) abstraction. This can be prohibitive if a user wants to get a single frame, perform some operations on a per-frame basis. For example, I've ran into multiple issues where I'd want to return a single frame, iterate over frames, or (for example in EPIC Kitchens dataset) reduce the memory usage by transforming the elements before I save them to output tensors.Pitch
We propose the following style of API: First we'd have a constructor that would be a part of torch registered C++ classes, and would take some basic inputs.
Returning a frame is as simple as calling
next
on the container [optionally, we can define stream from which we'd like to return the next frame from]. What a frame is will largely depend on the encoding of the video. For video, it is almost always an RGB image, whilst for audio it might be a 1024point sample. In most cases, same temporal timespan is covered with variable number of frames (1s of a video might contain 30 video frames and 40 audio frames), so returning the presentation timestamp of the returned frame allows for a more precise control of the resulting clip.To get the exact frame that we want, a seek function can be exposed (with an optional stream definition). Seeking is done either to the closest keyframe before the requested timestamp, or to the exact frame if possible.
For example if we seek into the 5s of a video container, following call to
next()
will return either 1) the last keyframe before 5s in the video (ifany_frame=False
), 2a) the frame with pts=5.0 (ifany_frame=True
and frame at 5s exist), or 2b) the first frame after 5s, e.g. with pts 5.03 (ifany_frame=True
and frame at 5s doesn't exist).We plan to expose metadata getters, and add additional functionality down the line.
Alternatives
In the end, every video decoder library is a tradeoff between speed and flexibility. Libraries that support batch decoding such as decord offer greater speed (due to multithreaded Loader objects and/or GPU decoding) at the expense of dataloader compatibility, robustness (in terms of available formats), or flexibility. Other libraries that offer greater flexibility such as pyav, opencv, or decord (in sequential reading mode) can sacrifice either speed or ease of use.
We're aiming for this API to be as close in flexibility to pyav as possible, with the same (or better) per-frame decoding speed, all of which by being torch scriptable.
Additional context
Whilst technically, this would mean depreciating our current
read_video
API, during a transition period, we would actually support it through a simple function that would mimic the implementation of current read_video, with minimum to no performance impact.cc @bjuncek