timoschick / pet

This repository contains the code for "Exploiting Cloze Questions for Few-Shot Text Classification and Natural Language Inference"
https://arxiv.org/abs/2001.07676
Apache License 2.0
1.62k stars 283 forks source link

commands for generation #46

Closed Harry-hash closed 2 years ago

Harry-hash commented 3 years ago

How should I run codes for generation tasks such as cnn-dailymail?

timoschick commented 3 years ago

Hi @Harry-hash, first, you'll need to checkout the feature/genpet branch for that. There's a couple of new features GenPET uses that, unfortunately, are not mentioned in the paper on arXiv due to an ongoing anonymity period, but to train a model (with all those features enabled) using the same hyperparameters as the paper, you can use the following command:

python3 cli.py \
    --method pet \
    --wrapper_type generative \
    --pattern_ids 2 3 4 5 \
    --data_dir . \
    --model_type pegasus \
    --model_name_or_path google/pegasus-large \
    --task_name ${TASK} \
    --output_dir ${OUTPUT_DIR} \
    --train_examples ${NUM_EXAMPLES} \
    --test_examples 10000 \
    --unlabeled_examples 1000 \
    --do_eval \
    --learning_rate 1e-4 \
    --eval_set test \
    --pet_per_gpu_eval_batch_size 32 \
    --pet_per_gpu_train_batch_size 2 \
    --pet_gradient_accumulation_steps 4 \
    --output_max_seq_length ${OUTPUT_MAX_SEQ_LENGTH} \
    --pet_max_steps 250 \
    --pet_max_seq_length 512 \
    --sc_per_gpu_train_batch_size 2 \
    --sc_gradient_accumulation_steps 4 \
    --sc_per_gpu_eval_batch_size 32 \
    --sc_max_steps 250 \
    --sc_max_seq_length 512 \
    --optimizer adafactor \
    --epsilon 0.1 \
    --do_train \
    --pet_repetitions 1 \
    --train_data_seed ${TRAIN_DATA_SEED} \
    --multi_pattern_training \
    --untrained_model_scoring \
    --cutoff_percentage 0.2

Here,

If you don't want to use the new features mentioned above, simply remove the last three lines (i.e., do not use --multi_pattern_training and --untrained_model_scoring and do not provide a --cutoff_percentage).

Harry-hash commented 3 years ago

Thank you very much for your detailed instructions! @timoschick

But when I was running the codes, a lot of error messages appear in the terminal, which says "Token indices sequence length is longer than the specified maximum sequence length for this model (1070 > 1024). Running this sequence through the model will result in indexing errors". Is it because the parameter max_length is not specified somewhere in tokenization? I am using transformer==3.3.1

timoschick commented 3 years ago

If everything else works as expected, you can ignore this error message. It's because PET has its own truncation logic to ensure that the mask token and the pattern are never truncated. Before applying this logic, the entire sequence is tokenized without any truncation, which is why some resulting sequences are longer than the model's maximum sequence length.