Closed bexxnaz closed 11 months ago
Hi,
happy to hear that our work might be useful for your language! Which language is it, if I may ask?
Regarding your questions:
Are there any specific techniques or strategies that can help mitigate the risk of catastrophic forgetting when fine-tuning the model for a single language? For instance, would it be advisable to fine-tune using a mixture of English and the target language?
First to clarify, you are only interested in image captioning in your language? Or do you want to preserve, for example, VQA or instruction following capabilities in your language or in general? If you do not care for other languages, training in your language only is a valid choice. If you want to preserve other skills aside from captioning, then training with a mix of data is probably best.
Which parts of the model would be best to freeze during the fine-tuning process to ensure optimal results?
You probably want to follow the setup we use so you train the q-former and lora in the LLM. The LLM itself and the ViT are frozen. Depending on your language, you can also experiment with training the embedding of the LLM if your compute allows this especially if your language does not use the latin script.
Do you have any suggestions regarding the type of data that would be most beneficial for the fine-tuning process?
A good start is probably the data we also use to train our model but just translated to your language. MSCOCO for captions and Llava for general instruction data are two quality datasets. VQAv2 is a good VQA dataset but it has short answers that are hard to translate automatically. You can also consider adding GQA to the mix, which has both short VQAv2-like answers and long answers. We did not add GQA to our mix because of our evaluation setup but it is a good dataset.
As a native speaker, I strongly suggest you check the translation quality and maybe try different models to see what works best for your language (NLLB or the recent MADLAD-400 are good public MT models). The translation quality is probably the most important factor for successful training.
Thank you for your quick response!
The language I am referring to is Persian (Farsi). Upon examining the results, it appears that the model maps the sentence representations of both Persian and English languages in the same space. I have noticed some grammatical errors, which I initially attributed to the translation data. This leads me to believe that if I train the model exclusively on one language, it may forget about other tasks and languages.
My goal is to maintain the VQA and Instruction Following capabilities in my language. Are you suggesting that I use a mixture of data tasks exclusively in my target language?
Which model did you try? Farsi is only tested in XM3600 unfortunately but there the mT0 model has 0.0 CIDEr but the BLOOMZ model has at least 13.84 so it is possible that the mT0 model has serious problems with Farsi. I checked some other mT0 models (from the ablation experiments) and some did achieve ~25 CIDEr so it seems that mT0 should be in theory okay for Farsi but really unstable in training.
I have noticed some grammatical errors, which I initially attributed to the translation data.
That is possible. I suggest you take a look at my training data (e.g., https://huggingface.co/datasets/Gregor/mblip-train/blob/main/mscoco/coco_train_mt.json, entries where "context" contains "Persian") to check for mistakes.
This leads me to believe that if I train the model exclusively on one language, it may forget about other tasks and languages.
That is correct but if you care for Farsi, then forgetting other languages at least is no problem.
My goal is to maintain the VQA and Instruction Following capabilities in my language. Are you suggesting that I use a mixture of data tasks exclusively in my target language?
Yes.
I conducted tests on mT0 using the coco-test dataset and manually examined the results to gain a better understanding of its performance. Thank you very much for your assistance.
You're welcome. When you have some results, feel free to share them and maybe even put your model and data on HuggingFace.
Thank you for your great work! I am interested in fine-tuning the mBLIP model for a low-resource language that currently has unsatisfactory performance in tasks such as image captioning. However, I have concerns about the possibility of the model deteriorating due to catastrophic forgetting. I would appreciate some guidance on:
Are there any specific techniques or strategies that can help mitigate the risk of catastrophic forgetting when fine-tuning the model for a single language? For instance, would it be advisable to fine-tune using a mixture of English and the target language?
Which parts of the model would be best to freeze during the fine-tuning process to ensure optimal results?
Do you have any suggestions regarding the type of data that would be most beneficial for the fine-tuning process?