Does vistion tower trained during starge 2 (Visual Instruction Tuning)?

GoGoJoestar commented 4 weeks ago

I find @torch.no_grad() in CLIPVisionTower.forward(), so it won't flow gradient to CLIP while training.

https://github.com/haotian-liu/LLaVA/blob/c121f0432da27facab705978f83c4ada465e46fd/llava/model/multimodal_encoder/clip_encoder.py#L45-L57

However, here is a key: "mm_vision_tower_lr": 2e-06," in model's config.json file, and in the LLaVA-NEXT blog on May 25th, the vision tower are training during stage-2 with lr=2e-6.

Were the previous models trained according to this strategy? Will training CLIP be better when training for downstream task?

2U1 commented 1 week ago

I think the training code isn't open yet for the LLaVA-NEXT.

PangziZhang523 commented 1 week ago

Print out the gradient，while requires_grad=True，the parameter.grad=None?

haotian-liu / LLaVA

Does vistion tower trained during starge 2 (Visual Instruction Tuning)? #1537