salesforce / LAVIS

LAVIS - A One-stop Library for Language-Vision Intelligence
BSD 3-Clause "New" or "Revised" License
9.29k stars 918 forks source link

I can provide more datasets from industry. #187

Open dongrixinyu opened 1 year ago

dongrixinyu commented 1 year ago

Language-image pretraining and finetuning is promising. I am new to this field and wanna get more info about datasets.

I wanna finetune the model in this repo, can I get a convinent approach to these datasets?

dxli94 commented 1 year ago

Hi, @dongrixinyu

dongrixinyu commented 1 year ago

Thank you for your info.

As you know, large language-image pretrained model is a promising aspect. I assume the mature multimodal model might be open to public in about 1~2 years.

So, does salesforce wanna continue to combine videos rather than images with language in the future? how do you think of your multimodal model compared to other companies?

dxli94 commented 1 year ago

BLIP-2 represents the state-of-the-art multimodal capabilities, evidenced by the evaluation as reported in the paper.

Videos are of interest and we explored it before in ALPRO, also included in LAVIS.