haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
https://llava.hliu.cc
Apache License 2.0
17.84k stars 1.93k forks source link

[Feature request] Why torch.no_grad on CLIPVisionTower Foward #305

Open LukeBailey181 opened 11 months ago

LukeBailey181 commented 11 months ago

feature

In llava/model/multimodal_encoder/clip_encoder.py line 39, the forward pass of the vision encoder has torch.no_grad decorator. I am trying to do some input optimization, and I think this is stopping gradients from being back propped to the input image. Is there a reason for this no_grad? Would it be ok to remove it? (I am happy to make a PR if so :) )

Thanks in advance for any help with this!

haotian-liu commented 11 months ago

Hi @LukeBailey181

Thanks for the feedback and for your interest in our project. You are right that this torch.no_grad is preventing you from doing input optimization. This maybe one of the overally cautious decorator I have used, to make sure that the vision encoder is not modified during pretraining/instruction tuning. Since we have vision_encoder.requires_grad_(False), this shall be fine.

It would be great if you can help create a PR about this, and we want to make sure that (1) vision encoder is not modified in any way that we do not want; (2) the gradients do not backward through the vision encoder unnecessarily (for most of the use cases, including the standard pretraining and instruction tuning), unless we need that like you are doing input optimization.

Thank you!

harrytea commented 4 months ago

how to optimize the vision encoder? which code should i modify?