Open dongrixinyu opened 1 year ago
Hi, @dongrixinyu
Thank you for your info.
As you know, large language-image pretrained model is a promising aspect. I assume the mature multimodal model might be open to public in about 1~2 years.
So, does salesforce wanna continue to combine videos rather than images with language in the future? how do you think of your multimodal model compared to other companies?
BLIP-2 represents the state-of-the-art multimodal capabilities, evidenced by the evaluation as reported in the paper.
Videos are of interest and we explored it before in ALPRO, also included in LAVIS.
Language-image pretraining and finetuning is promising. I am new to this field and wanna get more info about datasets.
I wanna finetune the model in this repo, can I get a convinent approach to these datasets?