mlfoundations / open_flamingo

An open-source framework for training large multimodal models.
MIT License
3.65k stars 277 forks source link

[FEATURE REQUEST] Enable Video Training #305

Open simplaj opened 1 month ago

simplaj commented 1 month ago

Is your feature request related to a problem? Please describe. I have been actively using this repository for multimodal training involving images and text. It has been incredibly helpful for my research and development. However, I am interested in expanding the capabilities to include video-based multimodal training. Currently, the repository does not support video inputs, which limits the scope of applications that can be developed.

Describe the workflow you want to enable. I would like to enable a workflow where video data can be seamlessly integrated into the existing multimodal training pipeline. This would involve handling video frames as sequential data and allowing the model to learn from both visual and textual information extracted from videos.

Describe your proposed solution. To address this, I propose the following: Implement support for video data by extending the current data handling pipeline to process video frames.

Describe alternatives you've considered An alternative solution could be to preprocess videos externally into a sequence of images and then feed these images into the existing image-based pipeline. However, this approach may not fully leverage the temporal information present in videos, and the preprocessing step could introduce additional complexity.

Additional context Supporting video inputs could significantly enhance the repository's utility for a wider range of applications, such as video captioning, action recognition, and video question answering.

Are you willing to help implement this feature? Yes, I am very keen to contribute to this feature. I have experience in handling video data and training multimodal models. I expect it might take a few weeks to implement and test the feature, depending on the complexity. I would appreciate any guidance or support from the OpenFlamingo team to ensure seamless integration with the existing codebase.

anas-awadalla commented 1 month ago

I agree adding video would be great! While we aren't making major changes to the codebase at the moment, I think you will find this to be partially supported already. For instance the resampler already can take in multiple frames, denoted by an F, in the current implementation this is always 1 but if you pass in multiple frames (a video) it will also work. I think you will still need to work on dataloader etc though. Hope this is helpful! I am excited to see what you train :).

simplaj commented 4 weeks ago

I agree adding video would be great! While we aren't making major changes to the codebase at the moment, I think you will find this to be partially supported already. For instance the resampler already can take in multiple frames, denoted by an F, in the current implementation this is always 1 but if you pass in multiple frames (a video) it will also work. I think you will still need to work on dataloader etc though. Hope this is helpful! I am excited to see what you train :).

Thanks for the information! I’ll explore the resampler’s capability to handle multiple frames and will start working on integrating video support into the dataloader. I’ll keep you updated on my progress and let you know if I need any further assistance. Looking forward to contributing to this enhancement!