huggingface / trl

Train transformer language models with reinforcement learning.
http://hf.co/docs/trl
Apache License 2.0
9.28k stars 1.16k forks source link

Why is num_labels=1 in the reward_madeling.py script? #1993

Open TolearnMo opened 2 weeks ago

TolearnMo commented 2 weeks ago

System Info

..

Information

Tasks

Reproduction

model = AutoModelForSequenceClassification.from_pretrained( model_config.model_name_or_path, num_labels=1, trust_remote_code=model_config.trust_remote_code, **model_kwargs )

I trained a reward model based on this script, but the output logits only have one element, which cannot be well used for subsequent PPO training

Expected behavior

RylanSchaeffer commented 2 weeks ago

num_labels is the dimensionality of the output. Here, you only need a 1 dimensional output. Unless I am misunderstanding your question?