[Question] Now I hope to take the pre-trained multimodal model and use the instruction fine-tuning method to adapt to the new data set of downstream tasks. However, the types of these data sets include: pure text mode and image text. Is there no way to fine-tune this data set with only text mode? (Or is it possible to fine-tune the language model in the multimodal model independently? [in a daze]) #1730
Question
No response