pytorch / vision

Datasets, Transforms and Models specific to Computer Vision
https://pytorch.org/vision
BSD 3-Clause "New" or "Revised" License
16.09k stars 6.94k forks source link

Make transforms work on video tensor inputs or batch of images #2583

Closed vfdev-5 closed 3 years ago

vfdev-5 commented 4 years ago

🚀 Feature

Following https://github.com/pytorch/vision/issues/2292#issuecomment-671325017 and discussion with @fmassa and @bjuncek , this feature request is to improve the transforms module to support video inputs as torch.Tensor of shape (C, T, H, W), where C - number of image channels (e.g. 3 for RGB), T - number of frames, H, W - image dimensions.

Points to discuss:

joakimjohnander commented 4 years ago

The representation (T, C, H, W) is convenient when processing one frame at a time, such as in object tracking. I would therefore advocate its support.

fmassa commented 4 years ago

Convention for geometric transforms: 2 last dimensions are H, W ?

Yes, I think we should assume that the two last dimensions are H and W. We will be leveraging in the future memory_format to account for channels-last, but the shape will still be the same

Convention for color transforms:

My first reaction would be that we should assume that the channels are the dim=-3. This way, we don't need to special-case for videos. And if the user needs their video in a different format (C, T, H, W for example) they can apply a permutation to it.

Another option (which I would let out for now) is to allow for a dim argument to the color transforms, so that the user can specify which dimension we should use for color. But I'm not necessarily in favor of this approach for now, as it might add more complexity to the user, and we should just assume a good default.

adding @takatosp1 @tullie for thoughts

bjuncek commented 4 years ago

My first reaction would be that we should assume that the channels are the dim=-3. This way, we don't need to special-case for videos. And if the user needs their video in a different format (C, T, H, W for example) they can apply a permutation to it.

I feel like the annoying part about this is that the toTensorVideo transforms ffmpeg output (T, H, W, C) to (T, C, H, W), so we should either apply some version of toTensor in the video reading part, bc otherwise we'd have to have sandwich of permutations in transforms.

fmassa commented 4 years ago

Everything under _transforms_video.py is private, including ToTensorVideo, so I would not worry about breaking BC for this. We don't actually want to keep those functions around for long actually

tullie commented 4 years ago

Your logic for (T, C, H, W) makes sense, particularly for transforms. Problem is - any video model with 3D convolutions is likely going to want T, H, W in the last 3 dims. I guess it depends if you're prioritizing the best format for writing transforms or for writing common video models.

haooooooqi commented 4 years ago

I think it depends on what are the following "operations" after the transforms. The advantage of T C H W is that we could use the same "image" operations efficiently. The advantage of C T H W is that we could do operations like 3d conv more efficiently. Currently I can see operations like spatial temporal crop (THWC preferred), color norm (CTHW preferred), color jitter (CTHW preferred), flip (does not matter). (in physical order) From the efficiency perspective, we could probably use the order that provides the most efficient indexing. From the implementation perspective, I generally find it would be nice if we don't need to do permute in the end.

fmassa commented 3 years ago

@vfdev-5 I believe this can be closed now?

vfdev-5 commented 3 years ago

@fmassa I thought to close it once we updated video classification ref example, #2935. But as you wish, otherwise I can create another one for that update.

fmassa commented 3 years ago

Ok, let's wait until we update the reference examples then before closing this one

vfdev-5 commented 3 years ago

Close the issue as #2935 has been merged.