luogen1996 / LLaVA-HR

LLaVA-HR: High-Resolution Large Language-Vision Assistant
Apache License 2.0
202 stars 9 forks source link

Got loss 0 after traing with larger input size in stage 2 (sft) #3

Open luohao123 opened 6 months ago

luohao123 commented 6 months ago

Got loss 0, any ideas?

{'loss': 2.4696, 'learning_rate': 1.655629139072848e-08, 'epoch': 0.0}                                                                                                                                                          
{'loss': 0.0, 'learning_rate': 3.311258278145696e-08, 'epoch': 0.0}                                                                                                                                                             
{'loss': 0.0, 'learning_rate': 4.966887417218544e-08, 'epoch': 0.0}                                                                                                                                                             
{'loss': 0.0, 'learning_rate': 6.622516556291392e-08, 'epoch': 0.0}  

am first trained a Projector with input size of 490, and do sft based on it, am direclty interpolate just like in your repo does.

luogen1996 commented 6 months ago

The image resolution should keep to 384 in the first stage, then you can increase the resolution in the second stage. Notbly, the image size should be divisible by 32 for accommodating ConvNeXt's setting.

luohao123 commented 6 months ago

From the script in your pretrain, I clearly see you have set input size to 384 while 336 in original clip model.

So here have some issues for me to clarify:

  1. you enlarged sizes, does the first stage enlarge level has limitations? if 384 is ok, why 448 not ?
  2. But my CLIP model using Siglip-384, so that I enlarged it a little bit to 490, do u think which formular should to follow for enlarge size in pretrain stage?

thanks for your reply

luogen1996 commented 6 months ago

384 is the resolution of ConvNeXT. And the resolution of CLIP is still 336 in the first stage. See bellowing codes in multipath_encoder_wapper.py, where we downsample the resolution to 336 for CLIP. fast_image_size=max(int(self.image_size/32*14),336) y = F.interpolate(x.float(), size=(fast_image_size, fast_image_size), mode='bilinear', align_corners=True).to(dtype=x.dtype) If you use Siglip-384 with a stride of 16 as the visual backbone, you may need modify self.image_size/32*14 to self.image_size/32*16. Then, you can directly run our script.

luogen1996 commented 6 months ago

384 is the resolution of ConvNeXT. And the resolution of CLIP is still 336 in the first stage. See bellowing codes in multipath_encoder_wapper.py, where we downsample the resolution to 336 for CLIP. fast_image_size=max(int(self.image_size/32*14),336) y = F.interpolate(x.float(), size=(fast_image_size, fast_image_size), mode='bilinear', align_corners=True).to(dtype=x.dtype) If you use Siglip-384 with a stride of 16 as the visual backbone, you may need modify self.image_size/32*14 to self.image_size/32*16. Then, you can directly run our script.

I don’t recommend you to increase the resolution in the first stage, which will hurt the final performance.