huggingface / trl

Train transformer language models with reinforcement learning.
http://hf.co/docs/trl
Apache License 2.0
9.83k stars 1.24k forks source link

During the execution of XPO, a 'tokenizer' KeyError suddenly occurred in callbacks.py #2264

Open ArcherShirou opened 1 day ago

ArcherShirou commented 1 day ago

System Info

Information

Tasks

Reproduction

I encountered a troubling issue while running the XPO program: the first 500 steps ran smoothly, but suddenly, an error occurred in the middle, as shown below:

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
 21%|█████████████████▎                                                                | 499/2361 [7:31:45<28:06:19, 54.34s/it]Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
 21%|█████████████████▎                                                                | 500/2361 [7:32:39<28:04:17, 54.30s/it]Traceback (most recent call last):
  File "/llm-align/trl/xpo.py", line 118, in <module>
    trainer.train()
  File "/llm-align/miniconda3/envs/trl/lib/python3.10/site-packages/transformers/trainer.py", line 2112, in train
    return inner_training_loop(
  File "/llm-align/miniconda3/envs/trl/lib/python3.10/site-packages/transformers/trainer.py", line 2533, in _inner_training_loop
    self.control = self.callback_handler.on_step_end(args, self.state, self.control)
  File "/llm-align/miniconda3/envs/trl/lib/python3.10/site-packages/transformers/trainer_callback.py", line 496, in on_step_end
    return self.call_event("on_step_end", args, state, control)
  File "/llm-align/miniconda3/envs/trl/lib/python3.10/site-packages/transformers/trainer_callback.py", line 518, in call_event
    result = getattr(callback, event)(
  File "/llm-align/trl/trl/trainer/callbacks.py", line 404, in on_step_end
    tokenizer = kwargs["tokenizer"]
KeyError: 'tokenizer'

Prior to this, the LogCompletionsCallback function was running normally and produced the following records:

{'loss': 0.6948, 'grad_norm': 0.6043311953544617, 'learning_rate': 4.826991329231957e-06, 'loss/dpo': 0.6947265625, 'loss/xpo': -0.000594329833984375, 'objective/kl': -0.00389404296875, 'objective/entropy': 56.7625, 'objective/model_scores': -3.3757302939891813, 'objective/ref_scores': -3.0598522454500197, 'objective/scores_margin': -0.3158780336380005, 'rewards/chosen': -0.00250091552734375, 'rewards/rejected': 0.000923919677734375, 'rewards/accuracies': 0.36875, 'rewards/margins': -0.0034198760986328125, 'logps/chosen': -109.475, 'logps/rejected': -117.65, 'val/model_contain_eos_token': 0.0, 'val/ref_contain_eos_token': 0.0, 'alpha': 1e-05, 'beta': 0.10000000000000002, 'epoch': 0.21}

I use trl-lib/ultrafeedback-prompt](https://huggingface.co/datasets/trl-lib/ultrafeedback-prompt) prompt only dataset like this:

[
    {
        "prompt": "create a table with 5 meals per day for 2 days, this is prepared for a 20 year old female. \nit sould be vegan, i should not contain nuts.\nshow a table with the meal, description, calorie count \nshow it in this style:\nDay n\nMeal n: meal name\n\nn ingredient\nn ingredient\nn ingredient\nn calories"
    },
    {
        "prompt": "In this task you will be given a list of integers. You should find the maximum absolute difference between 2 integers in the list. The absolute difference is the absolute value of one integer subtracted by another. The output should be a single integer which is the largest possible absolute distance.\nQ: [31, 28, -27]\nA:"
    },
...
]

Could you please advise on how to resolve this bug? Thanks

More Info

my script is:

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
deepspeed  --num_gpus 8 --master_port=29501  xpo.py \
    --deepspeed ds_config.json \
    --do_train \
    --model_name_or_path  /llm-align/qwen2.5-14B-update2 \
    --reward_model_path /llm-align/qwen2-0.5B-reward \
    --dataset_name ultrafeedback \
    --learning_rate 5.0e-6 \
    --beta 0.1 \
    --torch_dtype bfloat16 \
    --output_dir /llm-align/qwen2.5-14B-xpo-lora \
    --num_train_epochs 1 \
    --max_new_tokens 64 \
    --warmup_ratio 0.1 \
    --missing_eos_penalty 1.0 \
    --overwrite_output_dir \
    --logging_steps 10 \
    --optim paged_adamw_32bit \
    --save_steps 100 \
    --save_total_limit 5 \
    --lr_scheduler_type 'cosine' \
    --load_in_4bit \
    --use_bnb_nested_quant \
    --use_peft  \
    --lora_r 16 \
    --lora_alpha 16 \
    --lora_target_modules all-linear \
    --attn_implementation flash_attention_2 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 2 \
    --missing_eos_penalty 1.0  \
    --ddp_timeout 180000000

and I revise the offcial xpo.py as fllow:

   dataset = load_dataset('json', data_files={'train': '/llm-align/ultrafeedback-prompt-train.json',
                                               'test': '/llm-align/ultrafeedback-prompt-test.json'})   # use local dataset
    trainer = XPOTrainer(
            model=model,
            ref_model=ref_model,
            reward_model=reward_model,
            args=training_args,
            train_dataset=dataset['train'],
            eval_dataset=dataset['test'],
            processing_class=tokenizer,
            peft_config=get_peft_config(model_config),  # add this line
        )

  model.save_pretrained(training_args.output_dir) # save lora model

Expected behavior

NO

qgallouedec commented 1 day ago

Thanks for reporting, it should have been fixed with #2261. CAN you confirm?

ArcherShirou commented 11 hours ago

Thank you for your response. After updating the code and testing it, everything is running smoothly now. For the 14B and 72B models, quantization is necessary when using the 0.5B reward model. However, if I switch to the 70B or 72B reward model, I still encounter out-of-memory (OOM) issues midway, even with quantization and LoRA applied. Do you have any good solutions for this?