Inference on a whole video (or batch of image pairs) efficiently?

hmorimitsu / ptlflow

PyTorch Lightning Optical Flow models, scripts, and pretrained weights.

Apache License 2.0

250 stars 33 forks source link

Inference on a whole video (or batch of image pairs) efficiently? #28

Closed zplizzi closed 2 years ago

zplizzi commented 2 years ago

I would like to process a whole video efficiently - ie not calling model.forward() once for every pair of images, and instead batching things together. But I can't quite figure out how to do that with IOAdapter (which I would like to use, to ensure eg I use the correct padding). Is this possible? I tried formatting my video into a batch of image pairs, of shape (batch, 2, h, w, c), but this didn't seem to be supported by IOAdapter.

zplizzi commented 2 years ago

Ok, this feels a lil hacky, but works:

# Takes video of shape [n_images, h, w, c] (as a numpy array! torch tensor does not work) and applies padding, etc
inputs = io_adapter.prepare_inputs(video)
# inputs is a dict {'images': torch.Tensor}
# The tensor is 5D with a shape BNCHW. In this case, it will have the shape:
# (1, n_images, 3, H, W)

# we want to put this in (t-1, 2, c, h, w), where the 2 is every pair of frames
input_images = inputs["images"][0]
video1 = input_images[:-1]
video2 = input_images[1:]
input_images = torch.stack((video1, video2), dim=1)
inputs["images"] = input_images

predictions = model(inputs)

hmorimitsu commented 2 years ago

Yeah, I think this is the best you can do with the current functions. There is no batch support for the moment.

Adding batch preprocessing would be nice, but there are some tricky things I don't know how to handle for now. For example, if the sequence is too long, we may have to split it into several batches and be able to process all of them. It can be done, but I feel it would make the helper scripts a bit more complicated than they should be.

zplizzi commented 2 years ago

Yeah, in an ideal world you'd have a method to automatically find the largest batch size that fits in memory (like they do in https://pytorch-lightning.readthedocs.io/en/latest/advanced/training_tricks.html#auto-scaling-of-batch-size), and chunk the video into that size for processing. But I agree that's probably overkill here. Probably just adding the above trick into the docs somewhere would be plenty.

hmorimitsu commented 2 years ago

Added a small section with a link to this issue in the inference docs in #29 .