How to reproduce the paper metrics?

CristhianBoujon commented 4 years ago

I run the suggested example in the readme to try reproduce results showed in the paper:

CUDA_VISIBLE_DEVICES=0 python train_ditto.py \
  --task Structured/Beer \
  --batch_size 64 \
  --max_len 64 \
  --lr 3e-5 \
  --n_epochs 40 \
  --finetuning \
  --lm distilbert \
  --fp16 \
  --da del \
  --dk product \
  --summarize

If I'm right, based on the paper I expected to get around F1-score around 94.7 Captura de pantalla 2020-09-03 a la(s) 16 24 33

But I just get:

Captura de pantalla 2020-09-03 a la(s) 16 36 53

You can also see the training logs here

What am I doing wrong?

oi02lyl commented 4 years ago

For this dataset, we use the following setting (the summarization option is off). We used roberta, the drop_col da operator, the general dk, and a batch size of 32 for this dataset. The command should look like this:

CUDA_VISIBLE_DEVICES=0 python train_ditto.py \
  --task Structured/Beer \
  --batch_size 32 \
  --max_len 256 \
  --lr 3e-5 \
  --n_epochs 40 \
  --finetuning \
  --lm roberta \
  --fp16 \
  --da drop_col \
  --dk general`

I just re-ran the experiment and here are the results:

Baseline (no dk or da)

=========eval at epoch=40=========
Validation:
=============Structured/Beer==================
accuracy=0.967
precision=0.867
recall=0.929
f1=0.897
======================================
Test:
=============Structured/Beer==================
accuracy=0.967
precision=0.824
recall=1.000
f1=0.903
======================================

With DK (general) only Somehow the test result is the same as the baseline.

=========eval at epoch=40=========
Validation:
=============Structured/Beer==================
accuracy=0.967
precision=0.824
recall=1.000
f1=0.903
======================================
Test:
=============Structured/Beer==================
accuracy=0.967
precision=0.824
recall=1.000
f1=0.903
======================================

Ditto (with both dk and da)

=========eval at epoch=40=========
Validation:
=============Structured/Beer==================
accuracy=0.956
precision=0.857
recall=0.857
f1=0.857
======================================
Test:
=============Structured/Beer==================
accuracy=0.978
precision=0.875
recall=1.000
f1=0.933
======================================

In the paper's experiments, we reported the results by taking the epoch with the highest validation F1 score (here I am taking the last epoch).

I did notice something strange as the baseline result seems significantly higher than before. I remember that we updated the MixDA code from snippext which might have some effect on the data augmentation results too.

CristhianBoujon commented 4 years ago

Thank you!

megagonlabs / ditto

How to reproduce the paper metrics? #4