princeton-nlp / TransformerPrograms

[NeurIPS 2023] Learning Transformer Programs
https://arxiv.org/abs/2306.01128
157 stars 22 forks source link

In-context Learning task #4

Open ldqvinh opened 10 months ago

ldqvinh commented 10 months ago

Hello, Thanks for your work. I attempted the in-context learning training command from the experiment details, but encountered a 'loss is NaN' error. Could you share the command you used? Appreciate it.

python src/run.py \
     --dataset "induction" \
     --vocab_size 10 \
     --dataset_size 20000 \
     --min_length 1 \
     --max_length 10 \
     --n_epochs 250 \
     --batch_size 512 \
     --lr "5e-2" \
     --n_layers 2 \
     --n_heads_cat 1 \
     --n_heads_num 0 \
     --n_cat_mlps 1 \
     --n_num_mlps 0 \
     --one_hot_embed \
     --count_only \
     --seed 0 \
     --save \
     --save_code \
     --output_dir "output/induction"
danfriedman0 commented 10 months ago

Thanks for taking an interest in the code! I'm not immediately sure what the issue could be, but some things to try are:

  1. Set --min_length 10 and --max_length 10
  2. Set the --autoregressive flag.
  3. Set --n_cat_mlps 0 (no MLPs)
  4. Set --n_epochs = 500.

Note that you likely need to try a number of random seeds to get a model that successfully learns the task. To save time, we also used a "patience" of 25 (this is a possible argument to the run_training function, although you would need to modify src/run.py to make it a command-line flag).

Could you also share any more details about what you observe? Do you get "loss is NaN" right away, or only after some training?

Wangcheng-Xu commented 10 months ago

Hi @danfriedman0,

I also had issue replicating the induction experiment. The command was as suggested above, which is also copied below. I used a modified file "experiment_run_n.py" that iterates through seed when a training running out of the patience with an additional "patience" argument. The training seems to return a constant loss=5.81e+29 from seed 0 all the way to 100. By the way, some other experiments seemed to work, such as "sort", "reverse", etc.

CUDA_VISIBLE_DEVICES=0 python experiment_run_n.py \ --dataset "induction" \ --vocab_size 10 \ --dataset_size 20000 \ --min_length 10 \ --max_length 10 \ --n_epochs 500 \ --batch_size 512 \ --patience 25 \ --lr "5e-2" \ --n_layers 2 \ --n_heads_cat 1 \ --n_heads_num 0 \ --n_cat_mlps 0 \ --n_num_mlps 0 \ --one_hot_embed \ --count_only \ --autoregressive \ --seed 0 \ --save \ --save_code \ --output_dir "output/induction"

Also, it would be great if you can share the configuration for replicating all experiments in the paper, like that for "sort" and "conll_ner" in the README.md. Thanks!

danfriedman0 commented 10 months ago

Hi all, sorry for the trouble, and thanks for the additional detail.

I think I found the main problem: you need to set --unembed_mask 0. This flag is set to 1 by default, which prevents the model from predicting pad or unk as the output token, but the unk token is a valid prediction for this task. I have uploaded a script with a command that works for me (on around 20% of seeds).

@Wangcheng-Xu : The scripts directory contains configurations used for the other experiments in the paper. Please let me know if you have any more questions.

Wangcheng-Xu commented 10 months ago

Thank you! I have tested the fixed configuration for the induction task, which works for me.

ldqvinh commented 10 months ago

Thank you to everyone involved for identifying and resolving the issue. The updated configuration for the induction task is now functioning perfectly on my end as well.