huggingface / trl

Train transformer language models with reinforcement learning.
http://hf.co/docs/trl
Apache License 2.0
9.77k stars 1.23k forks source link

[bug in the document] dataset format for RewardTrainer #2164

Closed yananchen1989 closed 1 week ago

yananchen1989 commented 2 weeks ago

System Info

trl version > v0.11

Information

Tasks

Reproduction

In the official document https://huggingface.co/docs/trl/main/en/reward_trainer , The [RewardTrainer] requires a [implicit prompt preference dataset].

however, in the code script example: https://github.com/huggingface/trl/blob/main/examples/scripts/reward_modeling.py#L18 I see the example is using trl-lib/ultrafeedback_binarized which is not so-called "implicit prompt preference datase" as the prompt is explicitly provided in the dataset.

Could you look into this conflict ? thanks.

Expected behavior

code and document alignment.

yananchen1989 commented 2 weeks ago

here i test Anthropic/hh-rlhf and trl-lib/ultrafeedback_binarized in the dataset_name. but neither works.

(i do not change anything in reward_modeling.py which is directly cloned from trl repo)

CUDA_VISIBLE_DEVICES=0 python ~/trl/examples/scripts/reward_modeling.py \
    --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
    --dataset_name  ${ds} \
    --output_dir Qwen2-0.5B-Reward-LoRA \
    --per_device_train_batch_size 8 \
    --num_train_epochs 1 \
    --gradient_checkpointing True \
    --learning_rate 1.0e-4 \
    --logging_steps 25 \
    --eval_strategy steps \
    --eval_steps 50 \
    --max_length 2048 \
    --use_peft \
    --lora_r 32 \
    --lora_alpha 16

Traceback (most recent call last): File "/workspace/trl/examples/scripts/reward_modeling.py", line 120, in trainer.train() File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2052, in train return inner_training_loop( File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2345, in _inner_training_loop for step, inputs in enumerate(epoch_iterator): File "/usr/local/lib/python3.10/dist-packages/accelerate/data_loader.py", line 550, in iter current_batch = next(dataloader_iter) File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 630, in next data = self._next_data() File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 674, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch return self.collate_fn(data) File "/usr/local/lib/python3.10/dist-packages/trl/trainer/utils.py", line 362, in call raise ValueError( ValueError: The features should include input_ids_chosen, attention_mask_chosen, input_ids_rejected and attention_mask_rejected 0%| | 0/20100 [00:00<?, ?it/s]

yananchen1989 commented 2 weeks ago

on this pape. https://huggingface.co/docs/trl/v0.11.1/en/reward_trainer#reward-modeling

I see a conflict : image image

qgallouedec commented 1 week ago

In the official document https://huggingface.co/docs/trl/main/en/reward_trainer , The [RewardTrainer] requires a [implicit prompt preference dataset]. I see the example is using trl-lib/ultrafeedback_binarized which is not so-called "implicit prompt preference datase" as the prompt is explicitly provided in the dataset.

trl-lib/ultrafeedback_binarized is implicit prompt since you don't have a prompt column. You can see that there is a common start ({'content': 'Use the pygame library to write a version of the classic game Snake, with a unique twist', 'role': 'user'}), this is the so called implicit prompt:

>>> from daataset import load_dataset
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'daataset'
>>> from datasets import load_dataset
>>> dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
>>> dataset.column_names
['chosen', 'rejected', 'score_chosen', 'score_rejected']
>>> dataset[0]
{'chosen': [{'content': 'Use the pygame library to write a version of the classic game Snake, with a unique twist', 'role': 'user'}, {'content': "Sure, I'd be happy to help you write a version of the classic game Snake using the pygame library! ...", 'role': 'assistant'}],
'rejected': [{'content': 'Use the pygame library to write a version of the classic game Snake, with a unique twist', 'role': 'user'}, {'content': 'Sure, here\'s an example of how to write a version of Snake game with a unique twist using the Pygame library:...', 'role': 'assistant'}], 'score_chosen': 6.0, 'score_rejected': 4.0}
qgallouedec commented 1 week ago

here i test Anthropic/hh-rlhf and trl-lib/ultrafeedback_binarized in the dataset_name. but neither works.

The provided code works fine on my side:

python ../examples/scripts/reward_modeling.py \
     --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
     --dataset_name  trl-lib/ultrafeedback_binarized \
     --output_dir Qwen2-0.5B-Reward-LoRA \
     --per_device_train_batch_size 8 \
     --num_train_epochs 1 \
     --gradient_checkpointing True \
     --learning_rate 1.0e-4 \
     --logging_steps 25 \
     --eval_strategy steps \
     --eval_steps 50 \
     --max_length 2048 \
     --use_peft \
     --lora_r 32 \
     --lora_alpha 16

If the error persists, please provide your full system info (see bug issue template)

qgallouedec commented 1 week ago

The reward trainer data support has been recently updated (#2102) . see the latest version of the doc for more info: https://huggingface.co/docs/trl/main/en/reward_trainer

yananchen1989 commented 1 week ago

In the official document https://huggingface.co/docs/trl/main/en/reward_trainer , The [RewardTrainer] requires a [implicit prompt preference dataset]. I see the example is using trl-lib/ultrafeedback_binarized which is not so-called "implicit prompt preference datase" as the prompt is explicitly provided in the dataset.

trl-lib/ultrafeedback_binarized is implicit prompt since you don't have a prompt column. You can see that there is a common start ({'content': 'Use the pygame library to write a version of the classic game Snake, with a unique twist', 'role': 'user'}), this is the so called implicit prompt:

>>> from daataset import load_dataset
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'daataset'
>>> from datasets import load_dataset
>>> dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
>>> dataset.column_names
['chosen', 'rejected', 'score_chosen', 'score_rejected']
>>> dataset[0]
{'chosen': [{'content': 'Use the pygame library to write a version of the classic game Snake, with a unique twist', 'role': 'user'}, {'content': "Sure, I'd be happy to help you write a version of the classic game Snake using the pygame library! ...", 'role': 'assistant'}],
'rejected': [{'content': 'Use the pygame library to write a version of the classic game Snake, with a unique twist', 'role': 'user'}, {'content': 'Sure, here\'s an example of how to write a version of Snake game with a unique twist using the Pygame library:...', 'role': 'assistant'}], 'score_chosen': 6.0, 'score_rejected': 4.0}

i see. got it. trl-lib/ultrafeedback_binarized and Anthropic/hh-rlhf are on the same boat.

yananchen1989 commented 1 week ago
CUDA_VISIBLE_DEVICES=0 python /home/ubuntu/trl/examples/scripts/reward_modeling.py \
     --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
     --dataset_name  trl-lib/ultrafeedback_binarized \
     --output_dir Qwen2-0.5B-Reward-LoRA \
     --per_device_train_batch_size 8 \
     --num_train_epochs 1 \
     --gradient_checkpointing True \
     --learning_rate 1.0e-4 \
     --logging_steps 25 \
     --eval_strategy steps \
     --eval_steps 50 \
     --max_length 2048 \
     --use_peft \
     --lora_r 16 \
     --lora_alpha 16

error:

Traceback (most recent call last): File "/home/ubuntu/trl/examples/scripts/reward_modeling.py", line 120, in trainer.train() File "/opt/conda/envs/trl11/lib/python3.11/site-packages/transformers/trainer.py", line 2052, in train return inner_training_loop( ^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/envs/trl11/lib/python3.11/site-packages/transformers/trainer.py", line 2345, in _inner_training_loop for step, inputs in enumerate(epoch_iterator): File "/opt/conda/envs/trl11/lib/python3.11/site-packages/accelerate/data_loader.py", line 550, in iter current_batch = next(dataloader_iter) ^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/envs/trl11/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 630, in next data = self._next_data() ^^^^^^^^^^^^^^^^^ File "/opt/conda/envs/trl11/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 673, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/envs/trl11/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch return self.collate_fn(data) ^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/envs/trl11/lib/python3.11/site-packages/trl/trainer/utils.py", line 362, in call raise ValueError( ValueError: The features should include input_ids_chosen, attention_mask_chosen, input_ids_rejected and attention_mask_rejected 0%| | 0/7767 [00:00<?, ?it/s]

trl version: 0.11.1

by the way, trl env does not work:

Traceback (most recent call last): File "/opt/conda/envs/trl11/bin/trl", line 8, in sys.exit(main()) ^^^^^^ File "/opt/conda/envs/trl11/lib/python3.11/site-packages/trl/commands/cli.py", line 38, in main raise ValueError( ValueError: Please use one of the supported commands, got env - supported commands are ['sft', 'dpo', 'chat', 'kto']

yananchen1989 commented 1 week ago

python version: 3.11.10

qgallouedec commented 1 week ago

I've downgraded to v0.11.1 and I still can't reproduce the error.

by the way, trl env does not work:

trl env requires trl>=0.12. Can you run transformers-cli env instead?

Can you also confirm that you have not modified the codebase?

yananchen1989 commented 1 week ago
  • transformers version: 4.45.1
  • Platform: Linux-5.15.0-1061-aws-x86_64-with-glibc2.31
  • Python version: 3.11.10
  • Huggingface_hub version: 0.25.1
  • Safetensors version: 0.4.5
  • Accelerate version: 0.34.2
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.4.0+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?:
  • Using GPU in script?:
  • GPU type: NVIDIA A10G

@qgallouedec

yananchen1989 commented 1 week ago

i git pull in path /home/ubuntu/trl/

therefore everything is updated, including examples/scripts/reward_modeling.py

yananchen1989 commented 1 week ago

i install trl via pip install -U trl

qgallouedec commented 1 week ago

I still can't reproduce, I tried to reinstall everything, but it still works. Can you try the same? Also, try clearing your cache.

python3.11 -m venv env
source env/bin/activate
pip install trl[peft]==0.11.1
curl -O https://raw.githubusercontent.com/huggingface/trl/86ad7a7e85dc65c79bd9759097709a27ad1a58dd/examples/scripts/reward_modeling.py
python reward_modeling.py \
     --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
     --dataset_name  trl-lib/ultrafeedback_binarized \
     --output_dir Qwen2-0.5B-Reward-LoRA \
     --per_device_train_batch_size 8 \
     --num_train_epochs 1 \
     --gradient_checkpointing True \
     --learning_rate 1.0e-4 \
     --logging_steps 25 \
     --eval_strategy steps \
     --eval_steps 50 \
     --max_length 2048 \
     --use_peft \
     --lora_r 32 \
     --lora_alpha 16
Some weights of Qwen2ForSequenceClassification were not initialized from the model checkpoint at Qwen/Qwen2-0.5B-Instruct and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
/fsx/qgallouedec/trl/tmp/reward_modeling.py:108: UserWarning: You are using a `task_type` that is different than `SEQ_CLS` for PEFT. This will lead to silent bugs Make sure to pass --lora_task_type SEQ_CLS when using this script with PEFT.
  warnings.warn(
Filter: 100%|█████████████████████████████████████████████████████████████████| 62135/62135 [00:29<00:00, 2121.63 examples/s]
Filter: 100%|███████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 1926.69 examples/s]
/fsx/qgallouedec/trl/tmp/env/lib/python3.11/site-packages/trl/trainer/reward_trainer.py:199: UserWarning: When using RewardDataCollatorWithPadding, you should set `remove_unused_columns=False` in your RewardConfig we have set it for you, but you should do it yourself in the future.
  warnings.warn(
  0%|                                                                                               | 0/7750 [00:00<?, ?it/s]You're using a Qwen2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
/fsx/qgallouedec/trl/tmp/env/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:2855: UserWarning: `max_length` is ignored when `padding`=`True` and there is no truncation strategy. To pad to max length, use `padding='max_length'`.
  warnings.warn(
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
/fsx/qgallouedec/trl/tmp/env/lib/python3.11/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
Could not estimate the number of tokens of the input, floating-point operations will not be computed
{'loss': 0.7755, 'grad_norm': 3.030179262161255, 'learning_rate': 9.967741935483872e-05, 'epoch': 0.0}                       
{'loss': 0.71, 'grad_norm': 4.013882160186768, 'learning_rate': 9.935483870967742e-05, 'epoch': 0.01}                        
  1%|▌                                                                                   | 50/7750 [00:49<2:23:07,  1.12s/it]┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃ chosen_text                                       ┃ rejected_text                                      ┃ logits           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│ <|im_start|>system                                │ <|im_start|>system                                 │ [0.5081, 0.4919] │
│ You are a helpful assistant.<|im_end|>            │ You are a helpful assistant.<|im_end|>             │                  │
│ <|im_start|>user                                  │ <|im_start|>user                                   │                  │
│ As an HR manager, you want to test a potential    │ As an HR manager, you want to test a potential     │                  │
│ employee's ability to solve puzzles to determine  │ employee's ability to solve puzzles to determine   │                  │
│ their suitability for a job. Write a Python       │ their suitability for a job. Write a Python script │                  │
│ script that generates a list of questions that    │ that generates a list of questions that require    │                  │
│ require logical reasoning to answer. Your list    │ logical reasoning to answer. Your list should      │                  │
│ should include questions related to mathematical  │ include questions related to mathematical puzzles, │                  │
│ puzzles, language puzzles, logic puzzles, lateral │ language puzzles, logic puzzles, lateral thinking  │                  │
│ thinking puzzles, and pattern recognition         │ puzzles, and pattern recognition puzzles. Use the  │                  │
│ puzzles. Use the following code as a starting     │ following code as a starting point:                │                  │
│ point:                                            │ questions = {                                      │                  │
│ questions = {                                     │     "Mathematical puzzles": ["If the value of x+y  │                  │
│     "Mathematical puzzles": ["If the value of x+y │ = 20 and x-y = 10, what is the value of x and y?", │                  │
│ = 20 and x-y = 10, what is the value of x and     │ "If a pizza has a radius of 8 inches and is cut    │                  │
│ y?", "If a pizza has a radius of 8 inches and is  │ into 6 equal slices, what is the area of each      │ 
...
yananchen1989 commented 1 week ago

@qgallouedec

reward_modeling.py from your https://raw.githubusercontent.com/huggingface/trl/86ad7a7e85dc65c79bd9759097709a27ad1a58dd/examples/scripts/reward_modeling.py does work fine.

but the script from https://github.com/huggingface/trl/blob/main/examples/scripts/reward_modeling.py does not work.

I do see there is a lot of difference between them

qgallouedec commented 1 week ago

The later is the script for the dev version. You can't use trl 0.11 with it

yananchen1989 commented 1 week ago

ok, then i will wait for the dev version to be released. thanks. @qgallouedec

yananchen1989 commented 1 week ago

hi. just reopen this ticket.

although trl-lib/ultrafeedback_binarized works fine for reward_modeling.py, in trl version 0.11.2 but I also see that there is something wrong when using dataset: Anthropic/hh-rlhf.

this dataset is used as an example in https://huggingface.co/docs/trl/v0.11.2/reward_trainer

error message:

Traceback (most recent call last): File "/workspace/trl/examples/scripts/reward_modeling.py", line 140, in dataset = dataset.map( File "/usr/local/lib/python3.10/dist-packages/datasets/dataset_dict.py", line 866, in map { File "/usr/local/lib/python3.10/dist-packages/datasets/dataset_dict.py", line 867, in k: dataset.map( File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 560, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, args, kwargs) File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 3035, in map for rank, done, content in Dataset._map_single(dataset_kwargs): File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 3408, in _map_single example = apply_function_on_filtered_inputs(example, i, offset=offset) File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 3300, in apply_function_on_filtered_inputs processed_inputs = function(fn_args, *additional_args, **fn_kwargs) File "/workspace/trl/examples/scripts/reward_modeling.py", line 141, in lambda x: {"chosen": chosen_fn(x), "rejected": rejected_fn(x)}, num_proc=config.dataset_num_proc File "/usr/local/lib/python3.10/dist-packages/trl/extras/dataset_formatting.py", line 43, in format_dataset return tokenizer.apply_chat_template(examples[messages_field], tokenize=False) File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 1875, in apply_chat_template rendered_chat = compiled_template.render( File "/usr/local/lib/python3.10/dist-packages/jinja2/environment.py", line 1301, in render self.environment.handle_exception() File "/usr/local/lib/python3.10/dist-packages/jinja2/environment.py", line 936, in handle_exception raise rewrite_traceback_stack(source=source) File "