Why choose to train and finetune LLM for detection task instead of Vit or other mid layers?

shikras / shikra

Other

710 stars 44 forks source link

Why choose to train and finetune LLM for detection task instead of Vit or other mid layers? #4

Open double-fire-0 opened 1 year ago

zzhanghub commented 11 months ago

We speculate that the generalization of ViT from CLIP could be strong; however, training on a small amount of data might lead to catastrophic forgetting. Both the intermediate FC layer and LLM are involved in the training process. Yet, the training strategy remains an open and unresolved issue, and we welcome further discussion on this matter.

CYF2000127 commented 2 months ago

May I ask how to finetune ViT in the code? I set the ViT needs grad but still cant finetune I dont know why. Thank you very much.