Open Caixin89 opened 9 months ago
May I know if the results shown in table 8 above is validation set or test set scores?
table 8 shows validation results. may I know how many epochs have you run the model and what checkpoint did you use?
4 epochs for the 2 runs that use unchanged finetuning script 5 epochs when I changed the finetuning script to paper's settings
The last epochs are automatically decided based on early_stopping_patience=20
.
I am assuming you are using the last checkpoint the run generated instead of intermediate checkpoint? If so, try using more epochs. If it still doesn't work, I will provide finetuned checkpoint to see if the issue is on the evaluation script?
Sure, I can try that. In the mean while, could you share what is the number of epochs you have used for finetuning?
The above is a plot of the validation loss against training steps. The validaton loss is increasing consistently acorss training steps.
Is this expected?
I was unable to achieve the result shown in the UDOP paper.
I used the udop-unimodel-large-224 checkpoint.
My ANLS score is 0.407903. This is nowhere near 0.461 as shown in the table below taken from the paper.
Since I noticed that the batch size, warmup steps and weight decay given in https://github.com/microsoft/i-Code/blob/main/i-Code-Doc/scripts/finetune_duebenchmark.sh is different from reported in the paper, I also tried changing the finetuning script to use the paper's settings.
Lastly, I also tried adding the task prompt prefix since it is not done so in the existing code. I followed approach from #71 (comment)
Results of the 3 different finetuning configurations:
Task prefix Hyperparameter settings ANLS Score No Unchanged finetuning script 0.407903 No Paper's settings 0.40174 Yes Unchanged finetuning script 0.408355 Other changes I made:
Change to use pytorch's AdamW, based from loss does not have a grad fn #63 (comment) Within
baselines-master
indue-benchmark
repo:
- Apply fix from KeyError: 'common_format' due-benchmark/baselines#7 (comment)
- in
baselines-master/benchmarker/data/utils.py
, I changeddtype
of label_name fromU100
toU1024
to prevent truncation of questions during displayPlease assist
May i ask how you have implemented ANLS metric for the task?
I was unable to achieve the result shown in the UDOP paper. I used the udop-unimodel-large-224 checkpoint. My ANLS score is 0.407903. This is nowhere near 0.461 as shown in the table below taken from the paper. Since I noticed that the batch size, warmup steps and weight decay given in https://github.com/microsoft/i-Code/blob/main/i-Code-Doc/scripts/finetune_duebenchmark.sh is different from reported in the paper, I also tried changing the finetuning script to use the paper's settings.
Lastly, I also tried adding the task prompt prefix since it is not done so in the existing code. I followed approach from #71 (comment) Results of the 3 different finetuning configurations: Task prefix Hyperparameter settings ANLS Score No Unchanged finetuning script 0.407903 No Paper's settings 0.40174 Yes Unchanged finetuning script 0.408355 Other changes I made:
Change to use pytorch's AdamW, based from loss does not have a grad fn #63 (comment) Within
baselines-master
indue-benchmark
repo:
- Apply fix from KeyError: 'common_format' due-benchmark/baselines#7 (comment)
- in
baselines-master/benchmarker/data/utils.py
, I changeddtype
of label_name fromU100
toU1024
to prevent truncation of questions during displayPlease assist
May i ask how you have implemented ANLS metric for the task?
should be in this repo https://github.com/due-benchmark/evaluator/tree/master
Yes, I used ANLS from https://github.com/due-benchmark/evaluator/tree/master.
I am assuming you are using the last checkpoint the run generated instead of intermediate checkpoint? If so, try using more epochs. If it still doesn't work, I will provide finetuned checkpoint to see if the issue is on the evaluation script?
I have tried with 10 epochs and my ANLS is still ~0.41. Am I supposed to finetune with even more epochs?
Could you provide me with your finetuned checkpoint?
Also I would like to double check that the 46.1 ANLS score is indeed based on fine-tuning of udop-unimodel-large-224 checkpoint without additional supervised pre-training.
Correct?
I am assuming you are using the last checkpoint the run generated instead of intermediate checkpoint? If so, try using more epochs. If it still doesn't work, I will provide finetuned checkpoint to see if the issue is on the evaluation script?
I have tried with 10 epochs and my ANLS is still ~0.41. Am I supposed to finetune with even more epochs?
Could you provide me with your finetuned checkpoint?
@zinengtang Any updates?
I was unable to achieve the result shown in the UDOP paper.
I used the udop-unimodel-large-224 checkpoint.
My ANLS score is 0.407903. This is nowhere near 0.461 as shown in the table below taken from the paper.
Since I noticed that the batch size, warmup steps and weight decay given in https://github.com/microsoft/i-Code/blob/main/i-Code-Doc/scripts/finetune_duebenchmark.sh is different from reported in the paper, I also tried changing the finetuning script to use the paper's settings.
Lastly, I also tried adding the task prompt prefix since it is not done so in the existing code. I followed approach from https://github.com/microsoft/i-Code/issues/71#issuecomment-1623201208
Other changes I made:
Change to use pytorch's AdamW, based from https://github.com/microsoft/i-Code/issues/63#issuecomment-1608019905
Within
baselines-master
indue-benchmark
repo:baselines-master/benchmarker/data/utils.py
, I changeddtype
of label_name fromU100
toU1024
to prevent truncation of questions during displayPlease assist