timoschick / pet

This repository contains the code for "Exploiting Cloze Questions for Few-Shot Text Classification and Natural Language Inference"
https://arxiv.org/abs/2001.07676
Apache License 2.0
1.62k stars 283 forks source link

Replicating Petal results #47

Closed cylnlp closed 2 years ago

cylnlp commented 2 years ago

Hi, do you have any instructions on how to train a petal model and replicate the petal results?

Thanks, Yulong

timoschick commented 2 years ago

Hi @cylnlp, this requires two steps: First, you need to create a set of verbalizers using petal.py. Then, you can run regular training using cli.py, passing the file generated by petal.py using the --verbalizer_file arg.

To reproduce our results for PETAL (sep) (in Table 3 of the PETAL paper), you need to use the following parameters when calling petal.py (you can run petal.py --help to learn more about all available parameters):

--output_dir <YOUR_OUTPUT_DIR> \
--data_dir <YOUR_DATA_DIR> \
--words_file <YOUR_WORDS_FILE> \
--task_name <TASK_NAME> \
--pattern_ids <PATTERN_IDS> \
--normalize \
--model_type roberta \
--model_name_or_path roberta-large

For PETAL (joint), you additionally need to add the --combine_patterns flag.

cylnlp commented 2 years ago

Hi @timoschick, thank you for the reply.

cylnlp commented 2 years ago

Hi @timoschick, do you have the word file for reproducing the results in your paper? Or could you please tell me how the word file is generated? Thanks.

timoschick commented 2 years ago

Hi @cylnlp, the words_file is used here:

if args.words_file:
    with open(args.words_file, 'r', encoding='utf8') as fh:
        word_counts = Counter(fh.read().split())

We need these word counts to get a candidate set of the most frequent tokens - as mentioned in the paper:

We collect the 10,000 tokens that occur most frequently in the task’s unlabeled data and denote this filtered vocabulary by Tf .

You can easily generate the words_file yourself by writing your entire unlabeled data into a plain text file.

cylnlp commented 2 years ago

Hi @timoschick, thanks!