Support `CutMix` for video data

keras-team / keras-cv

Industry-strength Computer Vision workflows with Keras

Other

1.01k stars 331 forks source link

Support `CutMix` for video data #2384

Closed innat closed 7 months ago

innat commented 8 months ago

Short Description

Same as 2D cutmix, the requested feature is for 3D volume.

Papers

https://arxiv.org/abs/1905.04899

Existing Implementations

Other Information

tirthasheshpatel commented 8 months ago

This should theoretically be supported since we allow any tensors of shape (..., H, W, C). I am guessing videos are just frames of images with shape (B, NUM_FRAMES, H, W, C). Not sure though, how consistent is this support across all the preprocessing layers but IMO, it should be an easy fix even if it's broken.

tirthasheshpatel commented 8 months ago

I can see an argument for temporal consistency across frames when preprocessing but that seems like too much of a stretch from what KerasCV is designed to do. If we can treat frames independently, it would be much easier to add/advertise support for video data.

innat commented 8 months ago

The above gif is generated after adjusting some computation of the image-cutmix. If it's wanted, we can send a draft PR for evaluation.

tirthasheshpatel commented 8 months ago

I don't have a strong opinion. At the first glance though, I'd be against adding this simply because you can always reshape the input tensors to make them work for videos:

import numpy as np
from keras import ops
import keras_cv
from keras_cv.layers import CutMix

videos = np.random.standard_normal((2, 5, 256, 256, 3)).astype(np.float32)
labels = ((np.random.random((2, 5)) > 0.5) * 1.).astype(np.float32)

B, F, H, W, C = tuple(videos.shape)
images = ops.reshape(videos, (B * F, H, W, C))
labels = ops.reshape(labels, (B * F))
augmented = keras_cv.layers.CutMix()({"images": images, "labels": labels})
augmented = augmented["images"]
augmented = ops.reshape(augmented, (B, F, augmented.shape[-3], augmented.shape[-2], augmented.shape[-1]))

augmented  # augmented videos.

Does this work for your use case @innat?

innat commented 8 months ago

I tried reshaping approach at first. In the above code example, there are couple of issue.

First, it becomes limited due to num_frames == num_classes. Second, introducing complexity for augmented labels. Third, cutmixing is happening on video_a from many video samples in a given timestep, which breaks the temporal consistency, IMO. Instead, cutmixing video_a and video_b makes such sense to me. (Same goes to MixUp).

tirthasheshpatel commented 8 months ago

Again, no strong opinion. If you have a diff, feel free to propose. It would be also nice to first identify the layers where videos need to be treated differently. Like CutMix and MixUp. If you have a list, that'd be really helpful.

github-actions[bot] commented 8 months ago

This issue is stale because it has been open for 14 days with no activity. It will be closed if no further activity occurs. Thank you.

github-actions[bot] commented 7 months ago

This issue was closed because it has been inactive for 28 days. Please reopen if you'd like to work on this further.