Closed vfdev-5 closed 3 years ago
The representation (T, C, H, W)
is convenient when processing one frame at a time, such as in object tracking. I would therefore advocate its support.
Convention for geometric transforms: 2 last dimensions are H, W ?
Yes, I think we should assume that the two last dimensions are H and W. We will be leveraging in the future memory_format
to account for channels-last, but the shape will still be the same
Convention for color transforms:
My first reaction would be that we should assume that the channels are the dim=-3
. This way, we don't need to special-case for videos. And if the user needs their video in a different format (C, T, H, W
for example) they can apply a permutation to it.
Another option (which I would let out for now) is to allow for a dim
argument to the color transforms, so that the user can specify which dimension we should use for color. But I'm not necessarily in favor of this approach for now, as it might add more complexity to the user, and we should just assume a good default.
adding @takatosp1 @tullie for thoughts
My first reaction would be that we should assume that the channels are the dim=-3. This way, we don't need to special-case for videos. And if the user needs their video in a different format (C, T, H, W for example) they can apply a permutation to it.
I feel like the annoying part about this is that the toTensorVideo transforms ffmpeg output (T, H, W, C)
to (T, C, H, W)
, so we should either apply some version of toTensor
in the video reading part, bc otherwise we'd have to have sandwich of permutations in transforms.
Everything under _transforms_video.py
is private, including ToTensorVideo
, so I would not worry about breaking BC for this. We don't actually want to keep those functions around for long actually
Your logic for (T, C, H, W) makes sense, particularly for transforms. Problem is - any video model with 3D convolutions is likely going to want T, H, W in the last 3 dims. I guess it depends if you're prioritizing the best format for writing transforms or for writing common video models.
I think it depends on what are the following "operations" after the transforms. The advantage of T C H W is that we could use the same "image" operations efficiently. The advantage of C T H W is that we could do operations like 3d conv more efficiently. Currently I can see operations like spatial temporal crop (THWC preferred), color norm (CTHW preferred), color jitter (CTHW preferred), flip (does not matter). (in physical order) From the efficiency perspective, we could probably use the order that provides the most efficient indexing. From the implementation perspective, I generally find it would be nice if we don't need to do permute in the end.
@vfdev-5 I believe this can be closed now?
@fmassa I thought to close it once we updated video classification ref example, #2935. But as you wish, otherwise I can create another one for that update.
Ok, let's wait until we update the reference examples then before closing this one
Close the issue as #2935 has been merged.
🚀 Feature
Following https://github.com/pytorch/vision/issues/2292#issuecomment-671325017 and discussion with @fmassa and @bjuncek , this feature request is to improve the transforms module to support video inputs as
torch.Tensor
of shape(C, T, H, W)
, where C - number of image channels (e.g. 3 for RGB), T - number of frames, H, W - image dimensions.Points to discuss:
(C, T, H, W)
?(T, C, H, W)
?