huggingface / trl

Train transformer language models with reinforcement learning.
http://hf.co/docs/trl
Apache License 2.0
9.97k stars 1.26k forks source link

Feature Request: String-Based Comparison Reward model for RLOOTrainer #2280

Open HiroshigeAoki opened 1 week ago

HiroshigeAoki commented 1 week ago

Feature request

Add an option to the RLOOTrainer that enables the use of string-based reward models, such as BLEU and Levenshtein distance, for evaluating model outputs.

Motivation

Currently, the reward_model in RLOOTrainer accepts tensor inputs only, limiting the ability to use string-based metrics for reward model. Incorporating string comparison metrics would allow users to leverage a broader range of string similarity measures.

Your contribution

I am open to collaborating with the community to implement this feature!

qgallouedec commented 1 day ago

Is the use of this type of procedure common in the community/literature? Do you have any reference results?