huggingface / trl

Train transformer language models with reinforcement learning.
http://hf.co/docs/trl
Apache License 2.0
10.24k stars 1.3k forks source link

Doubts about reward #173

Closed Mryangkaitong closed 1 year ago

Mryangkaitong commented 1 year ago

Thank you very much for such a great work. When I run gpt2-sentiment.py (https://github.com/lvwerra/trl/blob/main/examples/sentiment/scripts/gpt2-sentiment.py#L151), I have a question I would like to ask: rewards = [torch.tensor(output[1]["score"]) for output in pipe_outputs] What is the value range of the reward score here(I print to see that there are greater than 1, greater than 2, etc.), is it directly taken logits? And rewards can be any range? For each input sentence, how is his reward obtained? Is it the sum of the logits of each token in the sentence?

lvwerra commented 1 year ago

Yes, we use the raw logits, we found that usually works better than e.g. softmax normalized outputs. And you get one reward per sequence from the classifier model so no need to aggregate them. In theory you could pass a reward per token but this is not implemented at the moment.

Mryangkaitong commented 1 year ago

Thank you very much, I have another question, that is, the model I want to train with ppo is AutoModelForSeq2SeqLM, just refer to here(https://github.com/lvwerra/trl/blob/main/examples/sentiment/scripts/gpt2-sentiment.py#L104) and change it to

from transformers import AutoModelForSeq2SeqLM model = AutoModelForSeq2SeqLM.from_pretrained(config.model_name, trust_remote_code=True) ref_model = AutoModelForSeq2SeqLM.from_pretrained(config.model_name, trust_remote_code=True) The model can be loaded successfully

But when running to line 113, an error is reported (https://github.com/lvwerra/trl/blob/main/examples/sentiment/scripts/gpt2-sentiment.py#L113), the following error: alueError: model must be a PreTrainedModelWrapper, got <class 'transformers_modules.local.modeling_glm.GLMForConditionalGeneration'> - supported architectures are: (<class 'trl.models.modeling_value_head.AutoModelForCausalLMWithValueHead'>, <class 'trl.models.modeling_value_head.AutoModelForSeq2SeqLMWithValueHead'>)

when i use from trl import AutoModelForSeq2SeqLMWithValueHead model = AutoModelForSeq2SeqLMWithValueHead.from_pretrained(config.model_name, trust_remote_code=True) But get the following error:

File "/root/anaconda3/envs/RLHF/lib/python3.8/site-packages/trl/models/modeling_base.py", line 81, in from_pretrained pretrained_model = cls.transformers_parent_class.from_pretrained( File "/root/anaconda3/envs/RLHF/lib/python3.8/site-packages/transformers/models/auto/auto_factory.py", line 434, in from_pretrained config, kwargs = AutoConfig.from_pretrained( File "/root/anaconda3/envs/RLHF/lib/python3.8/site-packages/transformers/models/auto/configuration_auto.py", line 855, in from_pretrained raise ValueError( ValueError: Loading /search/PPO/glm_0.3 requires you to execute the configuration file in that repo on your local machine. Make sure you have read the code there to avoid malicious use, then set the option trust_remote_code=True to remove this error.

How can I adapt in the current trl framework so that I can use ppo to train my AutoModelForSeq2SeqLM model? Thank you very much, looking forward to answer

lvwerra commented 1 year ago

@younesbelkada this might be an issue of a kwarg not being passed along to the original model class, wdyt?

younesbelkada commented 1 year ago

Hello @Mryangkaitong, Thanks for the issue! Yes this is possible, however the script below is working on the main branch of trl:

from trl import AutoModelForSeq2SeqLMWithValueHead
model = AutoModelForSeq2SeqLMWithValueHead.from_pretrained("t5-small", trust_remote_code=True)

I believe you need to use the latest changes of the library, by installing trl from source:

pip install git+https://github.com/lvwerra/trl.git

I believe this has been fixed on https://github.com/lvwerra/trl/pull/147

Let us know if the problem still persists!

Mryangkaitong commented 1 year ago

ok , I have solved it