Closed yananchen1989 closed 1 week ago
here i test Anthropic/hh-rlhf
and trl-lib/ultrafeedback_binarized
in the dataset_name
.
but neither works.
(i do not change anything in reward_modeling.py which is directly cloned from trl repo)
CUDA_VISIBLE_DEVICES=0 python ~/trl/examples/scripts/reward_modeling.py \
--model_name_or_path Qwen/Qwen2-0.5B-Instruct \
--dataset_name ${ds} \
--output_dir Qwen2-0.5B-Reward-LoRA \
--per_device_train_batch_size 8 \
--num_train_epochs 1 \
--gradient_checkpointing True \
--learning_rate 1.0e-4 \
--logging_steps 25 \
--eval_strategy steps \
--eval_steps 50 \
--max_length 2048 \
--use_peft \
--lora_r 32 \
--lora_alpha 16
Traceback (most recent call last): File "/workspace/trl/examples/scripts/reward_modeling.py", line 120, in
trainer.train() File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2052, in train return inner_training_loop( File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2345, in _inner_training_loop for step, inputs in enumerate(epoch_iterator): File "/usr/local/lib/python3.10/dist-packages/accelerate/data_loader.py", line 550, in iter current_batch = next(dataloader_iter) File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 630, in next data = self._next_data() File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 674, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch return self.collate_fn(data) File "/usr/local/lib/python3.10/dist-packages/trl/trainer/utils.py", line 362, in call raise ValueError( ValueError: The features should include input_ids_chosen
,attention_mask_chosen
,input_ids_rejected
andattention_mask_rejected
0%| | 0/20100 [00:00<?, ?it/s]
on this pape. https://huggingface.co/docs/trl/v0.11.1/en/reward_trainer#reward-modeling
I see a conflict :
In the official document https://huggingface.co/docs/trl/main/en/reward_trainer ,
The [RewardTrainer] requires a [implicit prompt preference dataset]
. I see the example is usingtrl-lib/ultrafeedback_binarized
which is not so-called "implicit prompt preference datase" as the prompt is explicitly provided in the dataset.
trl-lib/ultrafeedback_binarized is implicit prompt since you don't have a prompt column. You can see that there is a common start ({'content': 'Use the pygame library to write a version of the classic game Snake, with a unique twist', 'role': 'user'}
), this is the so called implicit prompt:
>>> from daataset import load_dataset
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'daataset'
>>> from datasets import load_dataset
>>> dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
>>> dataset.column_names
['chosen', 'rejected', 'score_chosen', 'score_rejected']
>>> dataset[0]
{'chosen': [{'content': 'Use the pygame library to write a version of the classic game Snake, with a unique twist', 'role': 'user'}, {'content': "Sure, I'd be happy to help you write a version of the classic game Snake using the pygame library! ...", 'role': 'assistant'}],
'rejected': [{'content': 'Use the pygame library to write a version of the classic game Snake, with a unique twist', 'role': 'user'}, {'content': 'Sure, here\'s an example of how to write a version of Snake game with a unique twist using the Pygame library:...', 'role': 'assistant'}], 'score_chosen': 6.0, 'score_rejected': 4.0}
here i test
Anthropic/hh-rlhf
andtrl-lib/ultrafeedback_binarized
in thedataset_name
. but neither works.
The provided code works fine on my side:
python ../examples/scripts/reward_modeling.py \
--model_name_or_path Qwen/Qwen2-0.5B-Instruct \
--dataset_name trl-lib/ultrafeedback_binarized \
--output_dir Qwen2-0.5B-Reward-LoRA \
--per_device_train_batch_size 8 \
--num_train_epochs 1 \
--gradient_checkpointing True \
--learning_rate 1.0e-4 \
--logging_steps 25 \
--eval_strategy steps \
--eval_steps 50 \
--max_length 2048 \
--use_peft \
--lora_r 32 \
--lora_alpha 16
If the error persists, please provide your full system info (see bug issue template)
The reward trainer data support has been recently updated (#2102) . see the latest version of the doc for more info: https://huggingface.co/docs/trl/main/en/reward_trainer
In the official document https://huggingface.co/docs/trl/main/en/reward_trainer ,
The [RewardTrainer] requires a [implicit prompt preference dataset]
. I see the example is usingtrl-lib/ultrafeedback_binarized
which is not so-called "implicit prompt preference datase" as the prompt is explicitly provided in the dataset.trl-lib/ultrafeedback_binarized is implicit prompt since you don't have a prompt column. You can see that there is a common start (
{'content': 'Use the pygame library to write a version of the classic game Snake, with a unique twist', 'role': 'user'}
), this is the so called implicit prompt:>>> from daataset import load_dataset Traceback (most recent call last): File "<stdin>", line 1, in <module> ModuleNotFoundError: No module named 'daataset' >>> from datasets import load_dataset >>> dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train") >>> dataset.column_names ['chosen', 'rejected', 'score_chosen', 'score_rejected'] >>> dataset[0] {'chosen': [{'content': 'Use the pygame library to write a version of the classic game Snake, with a unique twist', 'role': 'user'}, {'content': "Sure, I'd be happy to help you write a version of the classic game Snake using the pygame library! ...", 'role': 'assistant'}], 'rejected': [{'content': 'Use the pygame library to write a version of the classic game Snake, with a unique twist', 'role': 'user'}, {'content': 'Sure, here\'s an example of how to write a version of Snake game with a unique twist using the Pygame library:...', 'role': 'assistant'}], 'score_chosen': 6.0, 'score_rejected': 4.0}
i see. got it. trl-lib/ultrafeedback_binarized
and Anthropic/hh-rlhf
are on the same boat.
CUDA_VISIBLE_DEVICES=0 python /home/ubuntu/trl/examples/scripts/reward_modeling.py \
--model_name_or_path Qwen/Qwen2-0.5B-Instruct \
--dataset_name trl-lib/ultrafeedback_binarized \
--output_dir Qwen2-0.5B-Reward-LoRA \
--per_device_train_batch_size 8 \
--num_train_epochs 1 \
--gradient_checkpointing True \
--learning_rate 1.0e-4 \
--logging_steps 25 \
--eval_strategy steps \
--eval_steps 50 \
--max_length 2048 \
--use_peft \
--lora_r 16 \
--lora_alpha 16
error:
Traceback (most recent call last): File "/home/ubuntu/trl/examples/scripts/reward_modeling.py", line 120, in
trainer.train() File "/opt/conda/envs/trl11/lib/python3.11/site-packages/transformers/trainer.py", line 2052, in train return inner_training_loop( ^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/envs/trl11/lib/python3.11/site-packages/transformers/trainer.py", line 2345, in _inner_training_loop for step, inputs in enumerate(epoch_iterator): File "/opt/conda/envs/trl11/lib/python3.11/site-packages/accelerate/data_loader.py", line 550, in iter current_batch = next(dataloader_iter) ^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/envs/trl11/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 630, in next data = self._next_data() ^^^^^^^^^^^^^^^^^ File "/opt/conda/envs/trl11/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 673, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/envs/trl11/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch return self.collate_fn(data) ^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/envs/trl11/lib/python3.11/site-packages/trl/trainer/utils.py", line 362, in call raise ValueError( ValueError: The features should include input_ids_chosen
,attention_mask_chosen
,input_ids_rejected
andattention_mask_rejected
0%| | 0/7767 [00:00<?, ?it/s]
trl version: 0.11.1
by the way, trl env
does not work:
Traceback (most recent call last): File "/opt/conda/envs/trl11/bin/trl", line 8, in
sys.exit(main()) ^^^^^^ File "/opt/conda/envs/trl11/lib/python3.11/site-packages/trl/commands/cli.py", line 38, in main raise ValueError( ValueError: Please use one of the supported commands, got env - supported commands are ['sft', 'dpo', 'chat', 'kto']
python version: 3.11.10
I've downgraded to v0.11.1 and I still can't reproduce the error.
by the way, trl env does not work:
trl env
requires trl>=0.12. Can you run transformers-cli env
instead?
Can you also confirm that you have not modified the codebase?
transformers
version: 4.45.1- Platform: Linux-5.15.0-1061-aws-x86_64-with-glibc2.31
- Python version: 3.11.10
- Huggingface_hub version: 0.25.1
- Safetensors version: 0.4.5
- Accelerate version: 0.34.2
- Accelerate config: not found
- PyTorch version (GPU?): 2.4.0+cu121 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?:
- Using GPU in script?:
- GPU type: NVIDIA A10G
@qgallouedec
i git pull in path /home/ubuntu/trl/
therefore everything is updated, including examples/scripts/reward_modeling.py
i install trl via pip install -U trl
I still can't reproduce, I tried to reinstall everything, but it still works. Can you try the same? Also, try clearing your cache.
python3.11 -m venv env
source env/bin/activate
pip install trl[peft]==0.11.1
curl -O https://raw.githubusercontent.com/huggingface/trl/86ad7a7e85dc65c79bd9759097709a27ad1a58dd/examples/scripts/reward_modeling.py
python reward_modeling.py \
--model_name_or_path Qwen/Qwen2-0.5B-Instruct \
--dataset_name trl-lib/ultrafeedback_binarized \
--output_dir Qwen2-0.5B-Reward-LoRA \
--per_device_train_batch_size 8 \
--num_train_epochs 1 \
--gradient_checkpointing True \
--learning_rate 1.0e-4 \
--logging_steps 25 \
--eval_strategy steps \
--eval_steps 50 \
--max_length 2048 \
--use_peft \
--lora_r 32 \
--lora_alpha 16
Some weights of Qwen2ForSequenceClassification were not initialized from the model checkpoint at Qwen/Qwen2-0.5B-Instruct and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
/fsx/qgallouedec/trl/tmp/reward_modeling.py:108: UserWarning: You are using a `task_type` that is different than `SEQ_CLS` for PEFT. This will lead to silent bugs Make sure to pass --lora_task_type SEQ_CLS when using this script with PEFT.
warnings.warn(
Filter: 100%|█████████████████████████████████████████████████████████████████| 62135/62135 [00:29<00:00, 2121.63 examples/s]
Filter: 100%|███████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 1926.69 examples/s]
/fsx/qgallouedec/trl/tmp/env/lib/python3.11/site-packages/trl/trainer/reward_trainer.py:199: UserWarning: When using RewardDataCollatorWithPadding, you should set `remove_unused_columns=False` in your RewardConfig we have set it for you, but you should do it yourself in the future.
warnings.warn(
0%| | 0/7750 [00:00<?, ?it/s]You're using a Qwen2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
/fsx/qgallouedec/trl/tmp/env/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:2855: UserWarning: `max_length` is ignored when `padding`=`True` and there is no truncation strategy. To pad to max length, use `padding='max_length'`.
warnings.warn(
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
/fsx/qgallouedec/trl/tmp/env/lib/python3.11/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined]
Could not estimate the number of tokens of the input, floating-point operations will not be computed
{'loss': 0.7755, 'grad_norm': 3.030179262161255, 'learning_rate': 9.967741935483872e-05, 'epoch': 0.0}
{'loss': 0.71, 'grad_norm': 4.013882160186768, 'learning_rate': 9.935483870967742e-05, 'epoch': 0.01}
1%|▌ | 50/7750 [00:49<2:23:07, 1.12s/it]┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃ chosen_text ┃ rejected_text ┃ logits ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│ <|im_start|>system │ <|im_start|>system │ [0.5081, 0.4919] │
│ You are a helpful assistant.<|im_end|> │ You are a helpful assistant.<|im_end|> │ │
│ <|im_start|>user │ <|im_start|>user │ │
│ As an HR manager, you want to test a potential │ As an HR manager, you want to test a potential │ │
│ employee's ability to solve puzzles to determine │ employee's ability to solve puzzles to determine │ │
│ their suitability for a job. Write a Python │ their suitability for a job. Write a Python script │ │
│ script that generates a list of questions that │ that generates a list of questions that require │ │
│ require logical reasoning to answer. Your list │ logical reasoning to answer. Your list should │ │
│ should include questions related to mathematical │ include questions related to mathematical puzzles, │ │
│ puzzles, language puzzles, logic puzzles, lateral │ language puzzles, logic puzzles, lateral thinking │ │
│ thinking puzzles, and pattern recognition │ puzzles, and pattern recognition puzzles. Use the │ │
│ puzzles. Use the following code as a starting │ following code as a starting point: │ │
│ point: │ questions = { │ │
│ questions = { │ "Mathematical puzzles": ["If the value of x+y │ │
│ "Mathematical puzzles": ["If the value of x+y │ = 20 and x-y = 10, what is the value of x and y?", │ │
│ = 20 and x-y = 10, what is the value of x and │ "If a pizza has a radius of 8 inches and is cut │ │
│ y?", "If a pizza has a radius of 8 inches and is │ into 6 equal slices, what is the area of each │
...
@qgallouedec
reward_modeling.py from your https://raw.githubusercontent.com/huggingface/trl/86ad7a7e85dc65c79bd9759097709a27ad1a58dd/examples/scripts/reward_modeling.py does work fine.
but the script from https://github.com/huggingface/trl/blob/main/examples/scripts/reward_modeling.py does not work.
I do see there is a lot of difference between them
The later is the script for the dev version. You can't use trl 0.11 with it
ok, then i will wait for the dev version to be released. thanks. @qgallouedec
hi. just reopen this ticket.
although trl-lib/ultrafeedback_binarized
works fine for reward_modeling.py
, in trl version 0.11.2
but I also see that there is something wrong when using dataset: Anthropic/hh-rlhf
.
this dataset is used as an example in https://huggingface.co/docs/trl/v0.11.2/reward_trainer
error message:
Traceback (most recent call last): File "/workspace/trl/examples/scripts/reward_modeling.py", line 140, in
dataset = dataset.map( File "/usr/local/lib/python3.10/dist-packages/datasets/dataset_dict.py", line 866, in map { File "/usr/local/lib/python3.10/dist-packages/datasets/dataset_dict.py", line 867, in k: dataset.map( File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 560, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, args, kwargs) File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 3035, in map for rank, done, content in Dataset._map_single(dataset_kwargs): File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 3408, in _map_single example = apply_function_on_filtered_inputs(example, i, offset=offset) File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 3300, in apply_function_on_filtered_inputs processed_inputs = function(fn_args, *additional_args, **fn_kwargs) File "/workspace/trl/examples/scripts/reward_modeling.py", line 141, in lambda x: {"chosen": chosen_fn(x), "rejected": rejected_fn(x)}, num_proc=config.dataset_num_proc File "/usr/local/lib/python3.10/dist-packages/trl/extras/dataset_formatting.py", line 43, in format_dataset return tokenizer.apply_chat_template(examples[messages_field], tokenize=False) File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 1875, in apply_chat_template rendered_chat = compiled_template.render( File "/usr/local/lib/python3.10/dist-packages/jinja2/environment.py", line 1301, in render self.environment.handle_exception() File "/usr/local/lib/python3.10/dist-packages/jinja2/environment.py", line 936, in handle_exception raise rewrite_traceback_stack(source=source) File "", line 4, in top-level template code jinja2.exceptions.UndefinedError: 'str object' has no attribute 'role'
so only chat-format preference dataset like trl-lib/ultrafeedback_binarized
is supported in following versions ?
No, the following works fine:
python reward_modeling.py \
--model_name_or_path Qwen/Qwen2-0.5B-Instruct \
--dataset_name Anthropic/hh-rlhf \
--output_dir Qwen2-0.5B-Reward-LoRA \
--per_device_train_batch_size 8 \
--num_train_epochs 1 \
--gradient_checkpointing True \
--learning_rate 1.0e-4 \
--logging_steps 25 \
--eval_strategy steps \
--eval_steps 50 \
--max_length 2048 \
--use_peft \
--lora_r 32 \
--lora_alpha 16
@qgallouedec did you checkout the branch of v0.11-release
?
you checkout the branch and pip install -e .
from the source of this branch ? and it works fine ?
Indeed in v0.11.2, the example assumes that the dataset is in conversational format.
ok, so plain-text format such as Anthropic/hh-rlhf
is not supported anymore.
False. Previously it was not supported, now it is. dev is ahead of v0.11.2
ok, i will wait the new release and test it in the near future.
System Info
trl version > v0.11
Information
Tasks
examples
folderReproduction
In the official document https://huggingface.co/docs/trl/main/en/reward_trainer ,
The [RewardTrainer] requires a [implicit prompt preference dataset]
.however, in the code script example: https://github.com/huggingface/trl/blob/main/examples/scripts/reward_modeling.py#L18 I see the example is using
trl-lib/ultrafeedback_binarized
which is not so-called "implicit prompt preference datase" as the prompt is explicitly provided in the dataset.Could you look into this conflict ? thanks.
Expected behavior
code and document alignment.