commands for generation

Harry-hash commented 3 years ago

How should I run codes for generation tasks such as cnn-dailymail?

timoschick commented 3 years ago

Hi @Harry-hash, first, you'll need to checkout the feature/genpet branch for that. There's a couple of new features GenPET uses that, unfortunately, are not mentioned in the paper on arXiv due to an ongoing anonymity period, but to train a model (with all those features enabled) using the same hyperparameters as the paper, you can use the following command:

python3 cli.py \
    --method pet \
    --wrapper_type generative \
    --pattern_ids 2 3 4 5 \
    --data_dir . \
    --model_type pegasus \
    --model_name_or_path google/pegasus-large \
    --task_name ${TASK} \
    --output_dir ${OUTPUT_DIR} \
    --train_examples ${NUM_EXAMPLES} \
    --test_examples 10000 \
    --unlabeled_examples 1000 \
    --do_eval \
    --learning_rate 1e-4 \
    --eval_set test \
    --pet_per_gpu_eval_batch_size 32 \
    --pet_per_gpu_train_batch_size 2 \
    --pet_gradient_accumulation_steps 4 \
    --output_max_seq_length ${OUTPUT_MAX_SEQ_LENGTH} \
    --pet_max_steps 250 \
    --pet_max_seq_length 512 \
    --sc_per_gpu_train_batch_size 2 \
    --sc_gradient_accumulation_steps 4 \
    --sc_per_gpu_eval_batch_size 32 \
    --sc_max_steps 250 \
    --sc_max_seq_length 512 \
    --optimizer adafactor \
    --epsilon 0.1 \
    --do_train \
    --pet_repetitions 1 \
    --train_data_seed ${TRAIN_DATA_SEED} \
    --multi_pattern_training \
    --untrained_model_scoring \
    --cutoff_percentage 0.2

Here,

${TASK} is the name of the task (e.g., cnn-dailymail, see here);
${OUTPUT_DIR} is the output directory;
${NUM_EXAMPLES} is the number of training examples to use (in the paper, we experimented with 0, 10 and 100);
${OUTPUT_MAX_SEQ_LENGTH} is the maximum length of the generated output sequence (32 for aeslc and gigaword, 64 for xsum and 128 for all other tasks);
${TRAIN_DATA_SEED} is the seed used for initializing the RNG that selects the ${NUM_EXAMPLES} training examples. In the paper, we've used 0, 42 and 100.

If you don't want to use the new features mentioned above, simply remove the last three lines (i.e., do not use --multi_pattern_training and --untrained_model_scoring and do not provide a --cutoff_percentage).

Harry-hash commented 3 years ago

Thank you very much for your detailed instructions! @timoschick

But when I was running the codes, a lot of error messages appear in the terminal, which says "Token indices sequence length is longer than the specified maximum sequence length for this model (1070 > 1024). Running this sequence through the model will result in indexing errors". Is it because the parameter max_length is not specified somewhere in tokenization? I am using transformer==3.3.1

timoschick commented 3 years ago

If everything else works as expected, you can ignore this error message. It's because PET has its own truncation logic to ensure that the mask token and the pattern are never truncated. Before applying this logic, the entire sequence is tokenized without any truncation, which is why some resulting sequences are longer than the model's maximum sequence length.

timoschick / pet

commands for generation #46