Feature Request: String-Based Comparison Reward model for RLOOTrainer

Feature request

Add an option to the RLOOTrainer that enables the use of string-based reward models, such as BLEU and Levenshtein distance, for evaluating model outputs.

Motivation

Currently, the reward_model in RLOOTrainer accepts tensor inputs only, limiting the ability to use string-based metrics for reward model. Incorporating string comparison metrics would allow users to leverage a broader range of string similarity measures.

Your contribution

I am open to collaborating with the community to implement this feature!

huggingface / trl

Feature Request: String-Based Comparison Reward model for RLOOTrainer #2280

Feature request

Motivation

Your contribution