However, here is a key: "mm_vision_tower_lr": 2e-06," in model's config.json file, and in the LLaVA-NEXT blog on May 25th, the vision tower are training during stage-2 with lr=2e-6.
Were the previous models trained according to this strategy? Will training CLIP be better when training for downstream task?
I find
@torch.no_grad()
in CLIPVisionTower.forward(), so it won't flow gradient to CLIP while training.https://github.com/haotian-liu/LLaVA/blob/c121f0432da27facab705978f83c4ada465e46fd/llava/model/multimodal_encoder/clip_encoder.py#L45-L57
However, here is a key:
"mm_vision_tower_lr": 2e-06,"
in model'sconfig.json
file, and in the LLaVA-NEXT blog on May 25th, the vision tower are training during stage-2 with lr=2e-6.Were the previous models trained according to this strategy? Will training CLIP be better when training for downstream task?