Closed Mryangkaitong closed 1 year ago
Yes, we use the raw logits, we found that usually works better than e.g. softmax normalized outputs. And you get one reward per sequence from the classifier model so no need to aggregate them. In theory you could pass a reward per token but this is not implemented at the moment.
Thank you very much, I have another question, that is, the model I want to train with ppo is AutoModelForSeq2SeqLM, just refer to here(https://github.com/lvwerra/trl/blob/main/examples/sentiment/scripts/gpt2-sentiment.py#L104) and change it to
from transformers import AutoModelForSeq2SeqLM model = AutoModelForSeq2SeqLM.from_pretrained(config.model_name, trust_remote_code=True) ref_model = AutoModelForSeq2SeqLM.from_pretrained(config.model_name, trust_remote_code=True)
The model can be loaded successfully
But when running to line 113, an error is reported (https://github.com/lvwerra/trl/blob/main/examples/sentiment/scripts/gpt2-sentiment.py#L113), the following error:
alueError: model must be a PreTrainedModelWrapper, got <class 'transformers_modules.local.modeling_glm.GLMForConditionalGeneration'> - supported architectures are: (<class 'trl.models.modeling_value_head.AutoModelForCausalLMWithValueHead'>, <class 'trl.models.modeling_value_head.AutoModelForSeq2SeqLMWithValueHead'>)
when i use
from trl import AutoModelForSeq2SeqLMWithValueHead model = AutoModelForSeq2SeqLMWithValueHead.from_pretrained(config.model_name, trust_remote_code=True)
But get the following error:
File "/root/anaconda3/envs/RLHF/lib/python3.8/site-packages/trl/models/modeling_base.py", line 81, in from_pretrained pretrained_model = cls.transformers_parent_class.from_pretrained( File "/root/anaconda3/envs/RLHF/lib/python3.8/site-packages/transformers/models/auto/auto_factory.py", line 434, in from_pretrained config, kwargs = AutoConfig.from_pretrained( File "/root/anaconda3/envs/RLHF/lib/python3.8/site-packages/transformers/models/auto/configuration_auto.py", line 855, in from_pretrained raise ValueError( ValueError: Loading /search/PPO/glm_0.3 requires you to execute the configuration file in that repo on your local machine. Make sure you have read the code there to avoid malicious use, then set the option trust_remote_code=True to remove this error.
How can I adapt in the current trl framework so that I can use ppo to train my AutoModelForSeq2SeqLM model? Thank you very much, looking forward to answer
@younesbelkada this might be an issue of a kwarg not being passed along to the original model class, wdyt?
Hello @Mryangkaitong,
Thanks for the issue!
Yes this is possible, however the script below is working on the main
branch of trl
:
from trl import AutoModelForSeq2SeqLMWithValueHead
model = AutoModelForSeq2SeqLMWithValueHead.from_pretrained("t5-small", trust_remote_code=True)
I believe you need to use the latest changes of the library, by installing trl
from source:
pip install git+https://github.com/lvwerra/trl.git
I believe this has been fixed on https://github.com/lvwerra/trl/pull/147
Let us know if the problem still persists!
ok , I have solved it
Thank you very much for such a great work. When I run gpt2-sentiment.py (https://github.com/lvwerra/trl/blob/main/examples/sentiment/scripts/gpt2-sentiment.py#L151), I have a question I would like to ask:
rewards = [torch.tensor(output[1]["score"]) for output in pipe_outputs]
What is the value range of the reward score here(I print to see that there are greater than 1, greater than 2, etc.), is it directly taken logits? And rewards can be any range? For each input sentence, how is his reward obtained? Is it the sum of the logits of each token in the sentence?