haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
https://llava.hliu.cc
Apache License 2.0
17.84k stars 1.93k forks source link

Does vistion tower trained during starge 2 (Visual Instruction Tuning)? #1537

Open GoGoJoestar opened 4 weeks ago

GoGoJoestar commented 4 weeks ago

I find @torch.no_grad() in CLIPVisionTower.forward(), so it won't flow gradient to CLIP while training.

https://github.com/haotian-liu/LLaVA/blob/c121f0432da27facab705978f83c4ada465e46fd/llava/model/multimodal_encoder/clip_encoder.py#L45-L57

However, here is a key: "mm_vision_tower_lr": 2e-06," in model's config.json file, and in the LLaVA-NEXT blog on May 25th, the vision tower are training during stage-2 with lr=2e-6.

Were the previous models trained according to this strategy? Will training CLIP be better when training for downstream task?

2U1 commented 1 week ago

I think the training code isn't open yet for the LLaVA-NEXT.

PangziZhang523 commented 1 week ago

Print out the gradient,while requires_grad=True,the parameter.grad=None?