Finetuning on InfographicVQA

Caixin89 commented 9 months ago

I was unable to achieve the result shown in the UDOP paper.

I used the udop-unimodel-large-224 checkpoint.

My ANLS score is 0.407903. This is nowhere near 0.461 as shown in the table below taken from the paper.

Since I noticed that the batch size, warmup steps and weight decay given in https://github.com/microsoft/i-Code/blob/main/i-Code-Doc/scripts/finetune_duebenchmark.sh is different from reported in the paper, I also tried changing the finetuning script to use the paper's settings.

Lastly, I also tried adding the task prompt prefix since it is not done so in the existing code. I followed approach from https://github.com/microsoft/i-Code/issues/71#issuecomment-1623201208

Results of the 3 different finetuning configurations:	Task prefix	Hyperparameter settings
No	Unchanged finetuning script	0.407903
No	Paper's settings	0.40174
Yes	Unchanged finetuning script	0.408355

Other changes I made:

Change to use pytorch's AdamW, based from https://github.com/microsoft/i-Code/issues/63#issuecomment-1608019905

Within baselines-master in due-benchmark repo:
- Apply fix from https://github.com/due-benchmark/baselines/issues/7#issue-1638167863
- in baselines-master/benchmarker/data/utils.py, I changed dtype of label_name from U100 to U1024 to prevent truncation of questions during display

Please assist

Caixin89 commented 9 months ago

May I know if the results shown in table 8 above is validation set or test set scores?

zinengtang commented 9 months ago

table 8 shows validation results. may I know how many epochs have you run the model and what checkpoint did you use?

Caixin89 commented 9 months ago

4 epochs for the 2 runs that use unchanged finetuning script 5 epochs when I changed the finetuning script to paper's settings

The last epochs are automatically decided based on early_stopping_patience=20.

zinengtang commented 9 months ago

I am assuming you are using the last checkpoint the run generated instead of intermediate checkpoint? If so, try using more epochs. If it still doesn't work, I will provide finetuned checkpoint to see if the issue is on the evaluation script?

Caixin89 commented 9 months ago

Sure, I can try that. In the mean while, could you share what is the number of epochs you have used for finetuning?

Caixin89 commented 9 months ago

The above is a plot of the validation loss against training steps. The validaton loss is increasing consistently acorss training steps.

Is this expected?

Pietro1999IT commented 8 months ago

I was unable to achieve the result shown in the UDOP paper.

I used the udop-unimodel-large-224 checkpoint.

My ANLS score is 0.407903. This is nowhere near 0.461 as shown in the table below taken from the paper.

Since I noticed that the batch size, warmup steps and weight decay given in https://github.com/microsoft/i-Code/blob/main/i-Code-Doc/scripts/finetune_duebenchmark.sh is different from reported in the paper, I also tried changing the finetuning script to use the paper's settings.

Lastly, I also tried adding the task prompt prefix since it is not done so in the existing code. I followed approach from #71 (comment)

Results of the 3 different finetuning configurations:

Task prefix Hyperparameter settings ANLS Score No Unchanged finetuning script 0.407903 No Paper's settings 0.40174 Yes Unchanged finetuning script 0.408355 Other changes I made:

Change to use pytorch's AdamW, based from loss does not have a grad fn #63 (comment) Within baselines-master in due-benchmark repo:

Apply fix from KeyError: 'common_format' due-benchmark/baselines#7 (comment)

in baselines-master/benchmarker/data/utils.py, I changed dtype of label_name from U100 to U1024 to prevent truncation of questions during display

Please assist

May i ask how you have implemented ANLS metric for the task?

yuanzheng625 commented 8 months ago

I was unable to achieve the result shown in the UDOP paper. I used the udop-unimodel-large-224 checkpoint. My ANLS score is 0.407903. This is nowhere near 0.461 as shown in the table below taken from the paper. Since I noticed that the batch size, warmup steps and weight decay given in https://github.com/microsoft/i-Code/blob/main/i-Code-Doc/scripts/finetune_duebenchmark.sh is different from reported in the paper, I also tried changing the finetuning script to use the paper's settings.

Lastly, I also tried adding the task prompt prefix since it is not done so in the existing code. I followed approach from #71 (comment) Results of the 3 different finetuning configurations: Task prefix Hyperparameter settings ANLS Score No Unchanged finetuning script 0.407903 No Paper's settings 0.40174 Yes Unchanged finetuning script 0.408355 Other changes I made:

Change to use pytorch's AdamW, based from loss does not have a grad fn #63 (comment) Within baselines-master in due-benchmark repo:

Apply fix from KeyError: 'common_format' due-benchmark/baselines#7 (comment)

in baselines-master/benchmarker/data/utils.py, I changed dtype of label_name from U100 to U1024 to prevent truncation of questions during display

Please assist

May i ask how you have implemented ANLS metric for the task?

should be in this repo https://github.com/due-benchmark/evaluator/tree/master

Caixin89 commented 8 months ago

Yes, I used ANLS from https://github.com/due-benchmark/evaluator/tree/master.

Caixin89 commented 8 months ago

I am assuming you are using the last checkpoint the run generated instead of intermediate checkpoint? If so, try using more epochs. If it still doesn't work, I will provide finetuned checkpoint to see if the issue is on the evaluation script?

I have tried with 10 epochs and my ANLS is still ~0.41. Am I supposed to finetune with even more epochs?

Could you provide me with your finetuned checkpoint?

Caixin89 commented 8 months ago

Also I would like to double check that the 46.1 ANLS score is indeed based on fine-tuning of udop-unimodel-large-224 checkpoint without additional supervised pre-training.

Correct?

Caixin89 commented 7 months ago

I am assuming you are using the last checkpoint the run generated instead of intermediate checkpoint? If so, try using more epochs. If it still doesn't work, I will provide finetuned checkpoint to see if the issue is on the evaluation script?

I have tried with 10 epochs and my ANLS is still ~0.41. Am I supposed to finetune with even more epochs?

Could you provide me with your finetuned checkpoint?

@zinengtang Any updates?

microsoft / i-Code

Finetuning on InfographicVQA #125