talmolab / dreem

DREEM Relates Every Entities' Motion (DREEM). Global Tracking Transformers for biological multi-object tracking.
https://dreem.sleap.ai
BSD 3-Clause "New" or "Revised" License
7 stars 1 forks source link

Refactor video loader to use multiprocessing #95

Open shaikh58 opened 1 week ago

shaikh58 commented 1 week ago

Currently, we use the imageio video loader when loading a SleapDataset. Implementation here.

Proposing to refactor to using the decord library, a thin wrapper written in C++ (with a python API) around hardware accelerated decoders like ffmpeg. It's optimized to provide shuffling and random access to videos during training, supports 'cutting' a video using frame ids passed into the loader, and is integrated with PyTorch.

It is used by SAM2, and according to its Github, it achieves a 2x speedup vs OpenCV on sequential read, 14x on random seek, and 6x on real world training loading. It could be useful when training on large amounts of data e.g. when pretraining on large scale public datasets.

It's easy to implement:

VideoReader (directly access frames)

from decord import VideoReader
from decord import cpu, gpu

vr = VideoReader('examples/flipping_a_pancake.mkv', ctx=cpu(0))
decord.bridge.set_bridge('torch')
frames = vr.get_batch([1, 3, 5, 7, 9])

or the more complete VideoLoader:

VideoLoader: (from Github) VideoLoader is designed for training deep learning models with tons of video files. It provides smart video shuffle techniques in order to provide high random access performance (We know that seeking in video is super slow and redundant). The optimizations are underlying in the C++ code, which are invisible to user.

from decord import VideoLoader
from decord import cpu, gpu

vl = VideoLoader(['1.mp4', '2.avi', '3.mpeg'], ctx=[cpu(0)], shape=(2, 320, 240, 3), interval=1, skip=5, shuffle=1)
decord.bridge.set_bridge('torch')

print('Total batches:', len(vl))

for batch in vl:
    print(batch[0].shape)
aaprasad commented 6 days ago

We discussed this in #72 and @talmo noted that decord isnt much faster

In testing decord, I also found the differences to be pretty negligible in the video decoding performance, especially if you defer casting the uint8 frames to float32 until after you move the images to the GPU.

We use lightning to automatically move things to the gpu but we keep things as float32 throughout the datapipeline so it could be useful to decode into uint8. Do we know if lightning automatically casts things to the correct dtype as well? Otherwise, we need to include some logic in the model step to make sure things are converted to float32 first or use torch.autocast

shaikh58 commented 5 days ago

Edit: renamed issue to make it more relevant based on recent discussions.

We can consider using Pytorch's multiprocessing as part of the DataLoader (default is single process, num worker = 0). For a map style dataset, this seems easier and safer to implement (vs for an IterableDataset). Ours can be treated as a map style dataset as we load frames based on indices. The main process generates indices for worker processes to load via the Sampler. So we can make the Sampler choose temporally consistent indices to pass to worker processes. PyTorch recommends keeping this all on CPU to avoid CUDA/multiprocessing clashes.

As for data types, imageio's video reader reads in as uint8, and we convert to float32 here