turingmotors / heron

Apache License 2.0
157 stars 25 forks source link

Regarding training your own model #34

Open Aniketto16 opened 6 months ago

Aniketto16 commented 6 months ago

Hello! Thank you for your great work, I have following question :

I have my own Elyza7B checkpoint that I want to finetune on VQA task. If I follow the llava training scheme closely, I think we need to perform projection pretraining and then finetuning on chat task. From the documentation I don't understand which dataset should I use, should I directly use llava_ja or first mi3t and then llava_ja ? Also what is the difference between instruct and normal datasets ? Can you clarify, this would be really helpful!

Thank you so much again, looking forward to your reply!!

Ino-Ichan commented 5 months ago

@Aniketto16 Hi!

Thank you very much for your interest and for the kind words regarding our work! Regarding your question about finetuning your Elyza7B checkpoint for the VQA task, let's delve into some clarifications and recommendations.

Firstly, adopting a training scheme similar to LLaVA, where projection pretraining is followed by comprehensive LLM finetuning, could indeed be effective. However, we currently do not have a publicly available Japanese dataset for LLaVA pretraining. This means that directly mimicking the LLaVA training approach is not feasible at the moment (we are working on this, so please stay tuned for future updates).

From our experiments, we've found that finetuning both the projection and the LLM together, without separate projection pretraining, can also yield satisfactory results. This approach involves utilizing the full parameters for both components during finetuning, and we recommend giving it a try. See here.

Of course, pretraining on a Japanese Vision-Language dataset before proceeding to finetune the LLM on a specific VQA dataset is another strategy that is likely to be effective.

Regarding the distinction between "normal" and "instruct" datasets, the key difference lies in how the loss is calculated. For "normal" datasets, the loss is calculated across all input texts, whereas for "instruct" datasets, the loss is specifically calculated for the model's answers only. Instruction tuning typically utilizes "instruct" datasets, aiming to refine the model's ability to answer questions. We provide implementations for both types as a reference. While training with "normal" datasets can also be successful, it might lead to a model that tends to replicate human-like conversational patterns.

We hope this clarifies your queries and aids in your finetuning endeavors. Please feel free to reach out if you have further questions. We're excited to see the advancements you'll make with your project!