pytorch / vision

Datasets, Transforms and Models specific to Computer Vision
https://pytorch.org/vision
BSD 3-Clause "New" or "Revised" License
16.34k stars 6.97k forks source link

New video API Proposal #2660

Open bjuncek opened 4 years ago

bjuncek commented 4 years ago

πŸš€ Feature

We're proposing to add a lower level, more flexible, and equally robust API than the one currently existing in torchvision. It would be implemented in C++ and compatible with torchscript. Following the merge of https://github.com/pytorch/vision/pull/2596 it would also be installable via pip or conda

Motivation

Currently, our API supports returning a tensor of (TxCxHxW) via read_video (see here) abstraction. This can be prohibitive if a user wants to get a single frame, perform some operations on a per-frame basis. For example, I've ran into multiple issues where I'd want to return a single frame, iterate over frames, or (for example in EPIC Kitchens dataset) reduce the memory usage by transforming the elements before I save them to output tensors.

Pitch

We propose the following style of API: First we'd have a constructor that would be a part of torch registered C++ classes, and would take some basic inputs.

import torch.classes.torchvision as tvcls
vid = tvcls.Video(path, "stream:stream_id")

Returning a frame is as simple as calling next on the container [optionally, we can define stream from which we'd like to return the next frame from]. What a frame is will largely depend on the encoding of the video. For video, it is almost always an RGB image, whilst for audio it might be a 1024point sample. In most cases, same temporal timespan is covered with variable number of frames (1s of a video might contain 30 video frames and 40 audio frames), so returning the presentation timestamp of the returned frame allows for a more precise control of the resulting clip.

frame, timestamp = vid.next(optional: "stream:stream_id")

To get the exact frame that we want, a seek function can be exposed (with an optional stream definition). Seeking is done either to the closest keyframe before the requested timestamp, or to the exact frame if possible.

vid.seek(ts_in_seconds, any_frame=True)

For example if we seek into the 5s of a video container, following call to next() will return either 1) the last keyframe before 5s in the video (if any_frame=False), 2a) the frame with pts=5.0 (if any_frame=True and frame at 5s exist), or 2b) the first frame after 5s, e.g. with pts 5.03 (if any_frame=True and frame at 5s doesn't exist).

We plan to expose metadata getters, and add additional functionality down the line.

Alternatives

In the end, every video decoder library is a tradeoff between speed and flexibility. Libraries that support batch decoding such as decord offer greater speed (due to multithreaded Loader objects and/or GPU decoding) at the expense of dataloader compatibility, robustness (in terms of available formats), or flexibility. Other libraries that offer greater flexibility such as pyav, opencv, or decord (in sequential reading mode) can sacrifice either speed or ease of use.

We're aiming for this API to be as close in flexibility to pyav as possible, with the same (or better) per-frame decoding speed, all of which by being torch scriptable.

Additional context

Whilst technically, this would mean depreciating our current read_video API, during a transition period, we would actually support it through a simple function that would mimic the implementation of current read_video, with minimum to no performance impact.

cc @bjuncek

bjuncek commented 4 years ago

Adding the feature request tracker:

bryandeng commented 4 years ago

Hi, @bjuncek

Regarding the seek function, in the case that I set any_frame=False and call it several times to get a sequence of frames, if the target frames are between the same pair of key frames, is it true that the same key frame is being repeatedly decoded?

BTW, any plans for supporting compressed visual features without decoding?

bjuncek commented 4 years ago

Hi @bryandeng,

At the moment, seek function only implements "precise" seek, that is any_frame=False doesn't exist for now.

Having said that, I believe you are correct. Specifically the behaviour we're thinking of is the following:

Let's imagine there exists a pair of keyframes k_1 and k_2 with pts=1.2 and pts=2.4. Calling video_object.seek(2) will demux the packet that contains k_1 regardless of how many times you call it. You can get to k_2 by either a) iterating past it by calling next(video_object) many times, or b) seeking past 2.4 with any_frame=False option and then calling next(video_object).

Also, please note that seek will never actually return a frame. It initialises the decoder at the appropriate frame, so that it returns the frame we are asking it to at the following iteration.

Does that answer your first question?

BTW, any plans for supporting compressed visual features without decoding?

Not at the moment unfortunately. I'm not ruling it out completely, but it's out of scope for the next one or two releases for sure. Do you think there would be much demand for it?

Best, Bruno

bryandeng commented 4 years ago

Hi @bjuncek ,

Sorry for the late reply. Thanks for your clear explanation.

Regarding my previous question 1, will a "higher level" multiple-frame read function based on this new video API be provided, which takes a list of frame indices or pts' as input and hides the details of key frames, seeking and caching from the user? This resembles how the decord library is designed.

And for question 2, personally speaking compressed visual features are among the top priorities after ordinary visual and acoustic features. And as far as I know, toolkits like MMAction2 are implementing them.

Our team at Tencent uses a home-made video reading library which is also FFmpeg based. It supports CPU/GPU decoding and any_frame=False like imprecise seeking. We are motivated to contribute code according to the community's API designs.

fmassa commented 4 years ago

@bryandeng Hi,

will a "higher level" multiple-frame read function based on this new video API be provided

The idea is to replace the current implementation of torchvision.io.read_video, which takes a start_pts / end_pts and returns the frames between two two timestamps, to use VideoReader.

which takes a list of frame indices

There is unfortunately no generic and reliable way of figuring out the number of frames in an arbitrary video extension. So the approach taken by Decord, which proposes a __getitem__ based API can miss / repeat frames for some codecs.

This means that in order to reliably provide such functionality we would first need to decode all the video (or get an estimate of the pts for each frame, which might not always be possible), and that would be very slow. I would love to be proven wrong though :-)

And for question 2, personally speaking compressed visual features are among the top priorities after ordinary visual and acoustic features. And as far as I know, toolkits like MMAction2 are implementing them

This is currently not in our roadmap, but we could consider implementing this in the future (for torchvision 0.10 or beyond most probably). Can you open a separate issue to discuss this functionality?

Our team at Tencent uses a home-made video reading library which is also FFmpeg based. It supports CPU/GPU decoding and any_frame=False like imprecise seeking

@bjuncek is currently looking into implementing any_frame=False on the current setup, but only for CPU decoding.

GPU decoding for video is something we would love to explore, although I believe for now only a subset of the formats currently support GPU decoding.

We are motivated to contribute code according to the community's API designs

That would be great! Can you first open a separate issue for the GPU video decoding, so that we can discuss about the potential formats that would be supported etc ?

cc @takatosp1 @tullie for awareness