magic-research / PLLaVA

Official repository for the paper PLLaVA
593 stars 40 forks source link

Evaluation bug #22

Closed xumingze0308 closed 6 months ago

xumingze0308 commented 6 months ago

Hi,

When I evaluate the model on videoqabench, the model doesn't generate any answer but the prompt only here. The bug is very similar to this closed issue, but I still have this issue after adding your new commits. Can you please take a look at it? Thanks!

xumingze0308 commented 6 months ago

Hi,

I found the error that I set the wrong model_dir. To double confirm with you, in the eval.sh, I should set the model_dir to the folder of pertained lava-v1.6 and the weight_dir to my own fine-tuned folder (the lora weights), am I right?

Another question is that the evaluation give warning UserWarning: do_sample is set to False. However, top_p is set to 0.9 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset top_p. This is normal and should reproduce your results, right?

Thank you!

ermu2001 commented 6 months ago

Hi, For the first question, directly passing in the MODELS/pllava-7b should be fine, as long as you downloaded it from huggingface. The demo and evaluation shares a loading function here https://github.com/magic-research/PLLaVA/blob/fd9194ae55750c2e1ac677056f6286c126eda580/tasks/eval/model_utils.py#L39-L125

So the weights could be loaded from two sources:

  1. model_dir: this should have the weights name as the original transformer Pllava model.
  2. weight_dir: this is loaded after constructing the PeftModel, so should have weight named as the PeftModel

I think as long as the weights are loaded from one above, it should be fine. Loading from the downloaded weights, you should see "" in the terminal.

For the second question, the UserWarning is also seen in our evaluation, so it's safe by far.

BTW, I just fixed a response postprocess bug at https://github.com/magic-research/PLLaVA/commit/fd9194ae55750c2e1ac677056f6286c126eda580. So you might consider keeping up with the newest code. The former code might leave a leading space in the answer and it seems that chatgpt evaluation is sensitive to this leading space in the response (vcg score 3.10 v.s. 3.03 for pllava-7b with lora alpha set to 4).

xumingze0308 commented 6 months ago

Thank you very much for this clarification! I have another question about LoRA alpha. I found that the default training config uses 32 lora_alpha but evaluation uses 4 instead. After I changed the evaluation's lora_alpha to 32, the performance dropped a lot (e.g., MSVD ~77% -> ~73%). Did you observe the same thing?

valencebond commented 6 months ago

Thank you very much for this clarification! I have another question about LoRA alpha. I found that the default training config uses 32 lora_alpha but evaluation uses 4 instead. After I changed the evaluation's lora_alpha to 32, the performance dropped a lot (e.g., MSVD ~77% -> ~73%). Did you observe the same thing?

I noticed the same thing the lora_alpha is not consistent between the training and inference stages. As shown in Fig.9, the author claims that using lower alpha in the test stage achieves better performance.