Open TolearnMo opened 2 weeks ago
..
examples
model = AutoModelForSequenceClassification.from_pretrained( model_config.model_name_or_path, num_labels=1, trust_remote_code=model_config.trust_remote_code, **model_kwargs )
I trained a reward model based on this script, but the output logits only have one element, which cannot be well used for subsequent PPO training
。
num_labels is the dimensionality of the output. Here, you only need a 1 dimensional output. Unless I am misunderstanding your question?
num_labels
System Info
..
Information
Tasks
examples
folderReproduction
model = AutoModelForSequenceClassification.from_pretrained( model_config.model_name_or_path, num_labels=1, trust_remote_code=model_config.trust_remote_code, **model_kwargs )
I trained a reward model based on this script, but the output logits only have one element, which cannot be well used for subsequent PPO training
Expected behavior
。