luogen1996 / LLaVA-HR

LLaVA-HR: High-Resolution Large Language-Vision Assistant
Apache License 2.0
213 stars 11 forks source link

Ablation study on using just single path encoder? #1

Open lucasjinreal opened 8 months ago

lucasjinreal commented 8 months ago

What if didn;t use convnext added vision encoder?

luogen1996 commented 8 months ago

The model you mentioned with a single visual path is exactly LLaVA-1.5. We have conduct extensive comparisons in Tab 1 of our paper. Check out our paper. image

lucasjinreal commented 8 months ago

I notcied that you enlarge the size in llava-1.5 are using interpolate positional embedding after calculate position_ids.

This would notiablly drop performance as model haven't seen large sizes when training.

What I mean is that, have u did experiment on enlarge input size by interpolate position embedding weight, and then train it along with vision encoder or full model.

How do u think the differences of these two ways.

(Your interpolate embedding seems not trainable parameters. I didn't see a resize_position_embedding before training here but just interpolate after calculated position_ids)

luogen1996 commented 8 months ago

I get it, maybe your mentioned way is better. Let me try it.

lucasjinreal commented 8 months ago

Nice, let me know the differences between them after you tried.

lucasjinreal commented 8 months ago

@luogen1996 Hello, I am doing sft stage2 follow your code, using zero3 finetune, got some warnings:

- vision_model.head.layernorm.bias: found shape torch.Size([1152]) in the checkpoint and torch.Size([0]) in the model instantiated
- vision_model.head.layernorm.weight: found shape torch.Size([1152]) in the checkpoint and torch.Size([0]) in the model instantiated
- vision_model.head.mlp.fc1.bias: found shape torch.Size([4304]) in the checkpoint and torch.Size([0]) in the model instantiated
- vision_model.head.mlp.fc1.weight: found shape torch.Size([4304, 1152]) in the checkpoint and torch.Size([0]) in the model instantiated
- vision_model.head.mlp.fc2.bias: found shape torch.Size([1152]) in the checkpoint and torch.Size([0]) in the model instantiated
- vision_model.head.mlp.fc2.weight: found shape torch.Size([1152, 4304]) in the checkpoint and torch.Size([0]) in the model instantiated
- vision_model.head.probe: found shape torch.Size([1, 1, 1152]) in the checkpoint and torch.Size([0]) in the model instantiated
- vision_model.post_layernorm.bias: found shape torch.Size([1152]) in the checkpoint and torch.Size([0]) in the model instantiated
- vision_model.post_layernorm.weight: found shape torch.Size([1152]) in the checkpoint and torch.Size([0]) in the model instantiated

there was some reports: https://github.com/microsoft/DeepSpeed/issues/3574 indicates it related to enabling gradient_checkpoing and zero3 at the same time.

Does this effect model training? It looks like loss are normal

luogen1996 commented 8 months ago

I don't see these warnings in my logs. Weights you print are not used in model, so maybe you can directly ignore them.

BlueBlueFF commented 8 months ago

The model you mentioned with a single visual path is exactly LLaVA-1.5. We have conduct extensive comparisons in Tab 1 of our paper. Check out our paper. image

In table 1, are llava 1.5 trained on different resolutions? or only eval in different resolutions?

BlueBlueFF commented 8 months ago

The model you mentioned with a single visual path is exactly LLaVA-1.5. We have conduct extensive comparisons in Tab 1 of our paper. Check out our paper. image

In table 1, are llava 1.5 trained on different resolutions? or only eval in different resolutions?

@luogen1996 Thanks~

luogen1996 commented 8 months ago

The model you mentioned with a single visual path is exactly LLaVA-1.5. We have conduct extensive comparisons in Tab 1 of our paper. Check out our paper. image

In table 1, are llava 1.5 trained on different resolutions? or only eval in different resolutions?

Yes, we train llava 1.5 on different resolution. The training settings are the same as llava-hr, which includes low-resolution pre-training and high-resolution instruction tuning.

CserDu commented 3 months ago

@luogen1996 Hello, I am doing sft stage2 follow your code, using zero3 finetune, got some warnings:

- vision_model.head.layernorm.bias: found shape torch.Size([1152]) in the checkpoint and torch.Size([0]) in the model instantiated
- vision_model.head.layernorm.weight: found shape torch.Size([1152]) in the checkpoint and torch.Size([0]) in the model instantiated
- vision_model.head.mlp.fc1.bias: found shape torch.Size([4304]) in the checkpoint and torch.Size([0]) in the model instantiated
- vision_model.head.mlp.fc1.weight: found shape torch.Size([4304, 1152]) in the checkpoint and torch.Size([0]) in the model instantiated
- vision_model.head.mlp.fc2.bias: found shape torch.Size([1152]) in the checkpoint and torch.Size([0]) in the model instantiated
- vision_model.head.mlp.fc2.weight: found shape torch.Size([1152, 4304]) in the checkpoint and torch.Size([0]) in the model instantiated
- vision_model.head.probe: found shape torch.Size([1, 1, 1152]) in the checkpoint and torch.Size([0]) in the model instantiated
- vision_model.post_layernorm.bias: found shape torch.Size([1152]) in the checkpoint and torch.Size([0]) in the model instantiated
- vision_model.post_layernorm.weight: found shape torch.Size([1152]) in the checkpoint and torch.Size([0]) in the model instantiated

there was some reports: microsoft/DeepSpeed#3574 indicates it related to enabling gradient_checkpoing and zero3 at the same time.

Does this effect model training? It looks like loss are normal

Hello, I also meet this problem.. Could you please share how you resolved it?