showlab / Show-o

Repository for Show-o, One Single Transformer to Unify Multimodal Understanding and Generation.
https://arxiv.org/abs/2408.12528
Apache License 2.0
1.04k stars 44 forks source link

About checkpoints to be used by finetune #44

Open trmzpi02 opened 1 month ago

trmzpi02 commented 1 month ago

Hello! I am very interested in your work and see that you release the weight of Show-o before fine-tuning on LLaVA instructional tuning datasets.

I have the following two questions:

  1. I see that you recommend in the README to go to finetune on the basis of the show-o-512x512-wo-llava-tuning checkpoint, so why don't go to finetune on the basis of the show-o-512x512. Is it because there is a performance degradation on certain downstream tasks after fine-tuning on LLaVA instructional tuning datasets?

  2. If I want to fine-tune on certain visual downstream tasks, which checkpoint should I use?

Sierkinhane commented 1 month ago

Hi, thanks for your interest in our work. If you'd like to reproduce our results, you can try the pre-trained one. Besides, because the final checkpoint was fine-tuned on the llava data, further fine-tuning will degrade the performance (overfitting). If there is new training data, you can directly fine-tune the final checkpoint I think.