[Question] Now I hope to take the pre-trained multimodal model and use the instruction fine-tuning method to adapt to the new data set of downstream tasks. However, the types of these data sets include: pure text mode and image text. Is there no way to fine-tune this data set with only text mode? (Or is it possible to fine-tune the language model in the multimodal model independently? [in a daze])

haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

https://llava.hliu.cc

Apache License 2.0

20.42k stars 2.26k forks source link

Open Humble2967738843 opened 1 month ago

Humble2967738843 commented 1 month ago

No response