[Question] llava1.6 config.json doesn't match what is described in LLaVA-NeXT blog?

zwcolin commented 8 months ago

Question

Hi,

Thanks for the great work! I am doing research in relevant domains and I had some questions about LLaVA 1.6's training details, in particular:

(1) the blog says "It supports three aspect ratios, up to 672x672, 336x1344, 1344x336 resolution" but in config.json, it seems that only up to 1008 is supported. Which one should be correct? (2) since the visual encoder is kept frozen for both LLaVA and LLaVA 1.5, I assume that it's still kept frozen if it was not mentioned in the LLaVA-NeXT's blog. But I do see in config.json that unfreeze_mm_vision_tower is set to true and there's a learning rate associated with it. If possible, could you provide more details in how you finetuned the visual encoder?

l4b4r4b4b4 commented 8 months ago

@zwcolin did you manage to get fine-tuning to work with llava 1.6?

zwcolin commented 8 months ago

Implementation-wise, I think you at least need to (1) build the training pipeline mostly in train.py based on the inference pipeline; (2) finetune the visual encoder at some point using some learning rate; and (3) use the correct instruction tuning template depending on the LLM backbone. You might be able to infer some parameters based on config.json they put in their model weight folder.

That being said we cannot do more than just having this because we don't have the exact data that llava 1.6 is trained on, but say if you have your own data, you could try to have this implemented and see how it performs. Again I'm also a user of this codebase and my views are completely unofficial. To make sure everything is correct, I'd suggest waiting for the authors to release the official implementation to train llava 1.6.

l4b4r4b4b4 commented 8 months ago

Well I kinda have the feeling the code base has been put together in a haste. You sure have to prompt the LLM with its respective prompt template. But the provided code is sure not clear on that, as it uses llama [INST] prompt template for a mistral model, which generally is chatML based...

But looking at the date for train.py and llava_trainer.py I feel I can get this working with 1.6 using my experience in fine.tuning and training.

Ill connected on LinkedIn in case you would like to get into the weeds of things ;)

Implementation-wise, I think you at least need to (1) build the training pipeline mostly in train.py based on the inference pipeline; (2) finetune the visual encoder at some point using some learning rate; and (3) use the correct instruction tuning template depending on the LLM backbone. You might be able to infer some parameters based on config.json they put in their model weight folder.

That being said we cannot do more than just having this because we don't have the exact data that llava 1.6 is trained on, but say if you have your own data, you could try to have this implemented and see how it performs. Again I'm also a user of this codebase and my views are completely unofficial. To make sure everything is correct, I'd suggest waiting for the authors to release the official implementation to train llava 1.6.

haotian-liu / LLaVA

[Question] llava1.6 config.json doesn't match what is described in LLaVA-NeXT blog? #1214

Question