Open shaikh58 opened 1 week ago
We discussed this in #72 and @talmo noted that decord
isnt much faster
In testing decord, I also found the differences to be pretty negligible in the video decoding performance, especially if you defer casting the uint8 frames to float32 until after you move the images to the GPU.
We use lightning
to automatically move things to the gpu but we keep things as float32
throughout the datapipeline so it could be useful to decode into uint8. Do we know if lightning
automatically casts things to the correct dtype as well? Otherwise, we need to include some logic in the model step to make sure things are converted to float32
first or use torch.autocast
Edit: renamed issue to make it more relevant based on recent discussions.
We can consider using Pytorch's multiprocessing as part of the DataLoader (default is single process, num worker = 0). For a map style dataset, this seems easier and safer to implement (vs for an IterableDataset). Ours can be treated as a map style dataset as we load frames based on indices. The main process generates indices for worker processes to load via the Sampler. So we can make the Sampler choose temporally consistent indices to pass to worker processes. PyTorch recommends keeping this all on CPU to avoid CUDA/multiprocessing clashes.
As for data types, imageio's video reader reads in as uint8, and we convert to float32 here
Currently, we use the imageio video loader when loading a SleapDataset. Implementation here.
Proposing to refactor to using the decord library, a thin wrapper written in C++ (with a python API) around hardware accelerated decoders like ffmpeg. It's optimized to provide shuffling and random access to videos during training, supports 'cutting' a video using frame ids passed into the loader, and is integrated with PyTorch.
It is used by SAM2, and according to its Github, it achieves a 2x speedup vs OpenCV on sequential read, 14x on random seek, and 6x on real world training loading. It could be useful when training on large amounts of data e.g. when pretraining on large scale public datasets.
It's easy to implement:
VideoReader (directly access frames)
or the more complete VideoLoader:
VideoLoader: (from Github) VideoLoader is designed for training deep learning models with tons of video files. It provides smart video shuffle techniques in order to provide high random access performance (We know that seeking in video is super slow and redundant). The optimizations are underlying in the C++ code, which are invisible to user.