diegofiori commented 1 year ago

Description

Currently we are supporting the following datasets:

Stanford Human Preferences Dataset (SHP)
[Anthropic RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf

But we are not using all the information contained in the dataset:

rejected answers from Anthropic
up-votes for Stanford.

The number of upvotes, for instance, could be used as a label for the reward model (after some normalisation) to judge the quality of the answer without asking to a model or a human to return a feedback.

Moreover these datasets for training the reward models must be artificially augment with high quality negative examples to make the reward model learn not only what is good but also what is not.

Eventually a more robust research of possible useful datasets to be integrated must be carried out to ensure to support all the open-source datasets that can be relevant for the projects.

TODO

[ ] Implement a conversion between upvotes and reward for Stanford dataset.
[ ] Understand how the rejected answer can be used to augment the dataset quality of the reward model.
[ ] Test the validity of the conversion with a simple use case.
[ ] Introduce negative examples in the reward dataset that are meaningful to the model to assign the proper score.
[ ] Other datasets can be used?

MattiaSangermano commented 1 year ago

Hey, I would like to contribute to this task. Do you already have any ideas on how to convert upvotes into rewards, or are you completely open to suggestions? I looked at the dataset a bit, and so far I have only come up with some simple ideas, but nothing too satisfying.

diegofiori commented 1 year ago

Hi @MattiaSangermano, thank you very much for reaching out! Feel free to propose any idea on the issue. I haven't thought on it yet, but I'm happy to contribute to the brainstorming 😄

MattiaSangermano commented 1 year ago

Thank you @diegofiori, the simplest idea that comes to mind is to scale the upvote values between 0 and 5. I would perform the scaling by normalizing the upvotes of a response taking into account the "activity" of the individual post to which the response belongs. In this case, I would take the response with the most upvotes as an indicator of the post activity (max_upvote). Moreover, to ensure that other responses do not receive an unfairly low reward due to an excess of upvotes received by the winning response I would clip the max_upvote. One way to perform the clipping would be the IQR technique, even in this case I would compute the quantiles with respect to the post_id. A scatch of the reward function would be:

$$reward_p^i = \frac{scorep^i} {min(max\_upvote{p}, IQR_p)} * 5$$

where p is the index of the post, i is the response index of a post and $IQR_{p}$ is the upper whisker computed using the answer upscores of the post p.

Please let me know if I wasn't clear in my explanation.

diegofiori commented 1 year ago

I actually like the idea of using upvotes quantiles to compute the reward. I just have a couple of questions about your reward function.

Upvote scores are all positive, then IQR_p < max_upvote_p should always be true, shoudn't it? But in this way we would also have many rewards > 5 (>25% of the rewards)
I'd also take into account the relative difference between score A and score B when computing the reward
I'd probably propose as reward function a sum of boolean values, e.g.score^i_p > Q3_p or score B > score A (if we are computing the reward for B), capped to 5, WDYT?

MattiaSangermano commented 1 year ago

Upvote scores are all positive, then IQR_p < max_upvote_p should always be true, shoudn't it?

Not necessary, but I think using IQR to refer to the higher whisker was misleading. The higher whisker is Q3_p + 1.5 * (Q3_p - Q1_p), therefore if the value of max_upvote_p is far away from Q3_p then the inequality is false. It is similar to what happens when you draw a boxplot as you can have points that fall outside the arms of the boxplot.

But in this way we would also have many rewards > 5 (>25% of the rewards)

Yes you are right, we should threshold also the score: min(score^i_p,IQR_p)

I'd also take into account the relative difference between score A and score B when computing the reward

I am afraid that in this way we will create an inconsistent dataset . That is, we might have different answer pairs, where the reward of one answer is lower than another just because in the original dataset it was paired with an answer with many upvotes. The Standford dataset was constructed in a way that pairs the same response multiple times, leading to multiple rewards for that response. How can we combine these rewards?

I'd probably propose as reward function a sum of boolean values, e.g. score^i_p > Q3_p or score B > score A (if we are computing the reward for B), capped to 5, WDYT?

I don't know if I understood correctly, you would like to create 5 or more rules where the reward of an answer basically becomes the number of rules it can pass, right? If so, it sounds very interesting.

MattiaSangermano commented 1 year ago

@diegofiori any update?

diegofiori commented 1 year ago

Hi @MattiaSangermano, I see your point. I'm actually pretty curious to give a look to the implementation of the metric you proposed. Theoretically speaking it makes sense to me. I'm curious to see some examples from the dataset with the related computed score. I think this is the only way to effectively validate the metric.

MattiaSangermano commented 1 year ago

Perfect, I will work on it over the next few days, as soon as possible I will do a PR

nebuly-ai / optimate

[Chatllama] Use upvotes in Stanford dataset as a measure for reward #224

Description

TODO