thunlp / FewRel

A Large-Scale Few-Shot Relation Extraction Dataset
https://thunlp.github.io/fewrel.html
MIT License
731 stars 165 forks source link

Strange evaluation results #37

Closed xiang-deng closed 4 years ago

xiang-deng commented 4 years ago

Hi, I'm using the build-in evaluation function, but I'm getting some strange results. Seems all results are better than what's reported in the paper. Is there anything we should pay attention to such as the selection of Q and val_step to have the same behavior as the official evaluation script in CodaLab? Right now For proto, I can get 49.87% accuracy after 3000 steps on 5-way 1-shot, at step 30000 it reports 67.98% eval accuracy.

PRETRAIN=bert-base-uncased
VAL=val_pubmed
python train_demo.py \
    --train train_wiki\
    --val $VAL\
    --test $VAL\
    --trainN 5 \
    --N 5 \
    --K 1 \
    --Q 1 \
    --model proto \
    --encoder bert \
    --hidden_size 768 \
    --val_iter 1000 \
    --val_step 500  \
    --batch_size 2 \
    --grad_iter 2 \
    --pretrain_ckpt pretrain/$PRETRAIN
gaotianyu1350 commented 4 years ago

The paper reports the results on the test set, and it is normal that the val set result is different from the test set. The readme page reports some of the val set results and you can use them as a reference.

xiang-deng commented 4 years ago

But 67.98% is 20% higher than the 5-way 1-shot val result shown on the readme (although not the same model, but proto-adv should be better than proto right?). Can you confirm the above script is correct? Thanks!

gaotianyu1350 commented 4 years ago

The script is correct. You are using the BERT encoder so it is natural to beat the CNN encoder by a large margin. Note that the distribution of the validation set is a lot different from the test set so the results between them may differ a lot.