microsoft / LMOps

General technology for enabling AI capabilities w/ LLMs and MLLMs
https://aka.ms/GeneralAI
MIT License
3.6k stars 274 forks source link

The update method in the UCB algorithm is inconsistent with the paper and code #180

Open kerala21 opened 6 months ago

kerala21 commented 6 months ago

Q(p) for each prompt in the UCB algorithm of the paper is updated to Q(p) + r/N(p),

![Uploading 2024331203750.jpg…]()

The following table describes the project update code

def update(self, chosen, scores):

    for i, score in zip(chosen, scores):
        self.counts[i] += self.num_samples
        self.scores[i] += score * self.num_samples

Doesn't match

donglixp commented 4 months ago

The jpg file is unavailable.

hideaki-j commented 1 month ago

I was also a bit confused by that part. As I understand it, r/N in the paper seems to be a typo—actually, it should be Q + (r - Q)/N. This is because, to calculate the estimated score Q, we need to update the difference between the predicted Q and the observed reward r.

If so, Q + (r - Q)/N can be rewritten as:

((N - 1)Q + r)/N

This represents the average of all the rewards obtained.

self.scores[i] stores the total sum of all scores (rewards) so far. It will then be divided by counts (to calculate the average) in get_scores() when calculating ucb_scores.