Closed CristhianBoujon closed 4 years ago
For this dataset, we use the following setting (the summarization option is off). We used roberta, the drop_col da operator, the general dk, and a batch size of 32 for this dataset. The command should look like this:
CUDA_VISIBLE_DEVICES=0 python train_ditto.py \
--task Structured/Beer \
--batch_size 32 \
--max_len 256 \
--lr 3e-5 \
--n_epochs 40 \
--finetuning \
--lm roberta \
--fp16 \
--da drop_col \
--dk general`
I just re-ran the experiment and here are the results:
Baseline (no dk or da)
=========eval at epoch=40=========
Validation:
=============Structured/Beer==================
accuracy=0.967
precision=0.867
recall=0.929
f1=0.897
======================================
Test:
=============Structured/Beer==================
accuracy=0.967
precision=0.824
recall=1.000
f1=0.903
======================================
With DK (general) only Somehow the test result is the same as the baseline.
=========eval at epoch=40=========
Validation:
=============Structured/Beer==================
accuracy=0.967
precision=0.824
recall=1.000
f1=0.903
======================================
Test:
=============Structured/Beer==================
accuracy=0.967
precision=0.824
recall=1.000
f1=0.903
======================================
Ditto (with both dk and da)
=========eval at epoch=40=========
Validation:
=============Structured/Beer==================
accuracy=0.956
precision=0.857
recall=0.857
f1=0.857
======================================
Test:
=============Structured/Beer==================
accuracy=0.978
precision=0.875
recall=1.000
f1=0.933
======================================
In the paper's experiments, we reported the results by taking the epoch with the highest validation F1 score (here I am taking the last epoch).
I did notice something strange as the baseline result seems significantly higher than before. I remember that we updated the MixDA code from snippext which might have some effect on the data augmentation results too.
Thank you!
I run the suggested example in the readme to try reproduce results showed in the paper:
If I'm right, based on the paper I expected to get around F1-score around 94.7
But I just get:
You can also see the training logs here
What am I doing wrong?