timoschick / pet

This repository contains the code for "Exploiting Cloze Questions for Few-Shot Text Classification and Natural Language Inference"
https://arxiv.org/abs/2001.07676
Apache License 2.0
1.62k stars 282 forks source link

reproducing the results on RTE #64

Closed rabeehk closed 2 years ago

rabeehk commented 2 years ago

Hi I am following [1], I am running the following command, I am not able to reproduce RTE results and really appreciate any suggestions to try. thanks.

python3 cli.py --method pet --pattern_ids 0 1 2 3 --data_dir /idiap/user/rkarimi/dev/fewshot/internship/fewshot/temp/pet/FewGLUE_v2/FewGLUE_v2/ --model_type albert --model_name_or_path albert-xxlarge-v2 --task_name rte --output_dir /idiap/temp/rkarimi/temp/experiments/boolq/roberta/supervised--do_train --do_eval --pet_per_gpu_eval_batch_size 8 --pet_per_gpu_train_batch_size 2 --pet_gradient_accumulation_steps 8 --pet_max_steps 250 --pet_max_seq_length 256 --sc_per_gpu_train_batch_size 2 --sc_per_gpu_unlabeled_batch_size 2 --sc_gradient_accumulation_steps 8 --sc_max_steps 5000 --sc_max_seq_length 256

I am getting the following results:

2021-11-06 12:30:25,520 - INFO - modeling - === OVERALL RESULTS ===
2021-11-06 12:30:25,577 - INFO - modeling - dev_acc-p0: 0.5306859205776173 +- 0.0
2021-11-06 12:30:25,577 - INFO - modeling - dev_acc-p1: 0.5415162454873647 +- 0.0
2021-11-06 12:30:25,577 - INFO - modeling - dev_acc-p2: 0.5415162454873647 +- 0.0
2021-11-06 12:30:25,578 - INFO - modeling - dev_acc-p3: 0.5884476534296029 +- 0.0
2021-11-06 12:30:25,578 - INFO - modeling - dev_acc-all-p: 0.5505415162454874 +- 0.02332009048046016

[1] https://github.com/timoschick/pet/issues/19#issuecomment-747483960

timoschick commented 2 years ago

Hi @rabeehk, did you use the exact command that you've posted here? I'm asking because there's a very important space missing. Your command is:

[...] --output_dir /idiap/temp/rkarimi/temp/experiments/boolq/roberta/supervised--do_train [...]

when it should be

[...] --output_dir /idiap/temp/rkarimi/temp/experiments/boolq/roberta/supervised --do_train [...]

With the former command, no training is performed at all (so you're basically getting zero-shot results) and the outputs are written to a directory called /idiap/temp/rkarimi/temp/experiments/boolq/roberta/supervised--do_train. With the latter command, training is performed and outputs are written to /idiap/temp/rkarimi/temp/experiments/boolq/roberta/supervised.

Let me know if this fixes your problem!

savasy commented 2 years ago

With the following command !python cli.py \ --method pet \ --pattern_ids 0 1 2 3 4 \ --data_dir ./fewglue/FewGLUE/BoolQ \ --model_type albert \ --model_name_or_path albert-base-v2 \ --task_name boolq \ --output_dir /tmp/pet \ --do_train \ --do_eval \ --pet_per_gpu_eval_batch_size 8 \ --pet_per_gpu_train_batch_size 2 \ --pet_gradient_accumulation_steps 8 \ --pet_max_steps 250 \ --pet_max_seq_length 256 \ --sc_per_gpu_train_batch_size 2 \ --sc_per_gpu_unlabeled_batch_size 2 \ --sc_gradient_accumulation_steps 8 \ --sc_max_steps 5000 \ --sc_max_seq_length 256

and with the following data point distribution

!wc -l fewglue/FewGLUE/BoolQ/* 32 fewglue/FewGLUE/BoolQ/train.jsonl 9427 fewglue/FewGLUE/BoolQ/unlabeled.jsonl 3270 fewglue/FewGLUE/BoolQ/val.jsonl

I got the following results,

acc-p0: 0.5426095820591234 +- 0.0072968522602133755 acc-p1: 0.5753312945973497 +- 0.008348866556347945 acc-p2: 0.5363914373088685 +- 0.006885828898591881 acc-p3: 0.5649337410805301 +- 0.007555023022910462 acc-p4: 0.5482161060142712 +- 0.02863262975653322 acc-all-p: 0.5534964322120286 +- 0.019335782158363467

I think I'm doing something wrong somewhere. 'Cause I needed to get score something like ~79.0

timoschick commented 2 years ago

Hi @savasy, if your aim is to reproduce our results, you're using the wrong (much smaller) language model: Our experiments are conducted with albert-xxlarge-v2, whereas you are using albert-base-v2 (see also our paper, @rabeehk's command above or this thread).

rabeehk commented 2 years ago

Hi Timo Thank you, I realized I was using 64-samples dataset. At the end, I reimplemented the whole code base also, all fine now. thanks a lot. Best Rabeeh