glandfried / TrueSkillThroughTime.py

The TrueSkill Through Time Python Package
https://trueskillthroughtime.readthedocs.io/en/latest/
Other
33 stars 6 forks source link

Convergence rescales the distribution off of a baseline #5

Open jake-smart opened 1 year ago

jake-smart commented 1 year ago

When initializing with priors - convergence rescales the average rating far away from the baseline priors.

import trueskillthroughtime as ttt
player = set_of_player_names
priors = {player: ttt.Player(ttt.Gaussian(mu=25.0, sigma=8.3333)) for matchup in composition for team in matchup for player in team}

h = ttt.History(
    composition=composition, 
    priors=priors,
)

# Get pre convergence values
latest_rating_values = [rating_array[-1][1].mu for rating_array in h.learning_curves().values()]
avg_rating_value = sum(latest_rating_values)/len(latest_rating_values)

h.convergence()

converged_latest_rating_values = [rating_array[-1][1].mu for rating_array in h.learning_curves().values()]
avg_converged_value = sum(converged_latest_rating_values)/len(converged_latest_rating_values)
print(f"{avg_rating_value} -> {avg_converged_value}")

This results in: 23.776132491952705 -> 5.261152040525114

Is there a good reason for this?

glandfried commented 1 year ago

Hi!,

First, it is good to remember that the absolute value of the estimates does not have a meaning by itself. What really matters is the skill difference between individuals. This difference determines the probability of winning, no matter the absolute value.

Second, you are using the mean value to report the summary, which is sensible to extremes values. Many reasonable things may be going on in there, we don't know looking just the mean value. Perhaps your dataset has a large pool of low-skill players who, when converged, fall far below the prior mean. We don't know without more information.

Third, look the estimates we get using the history of the ATP (download paper). Our prior mean is mu=0, our prior standard deviation is sigma=1.6, and our prior dynamic uncertainty is gamma=0.036. If you look at the estimated skills of the most famous players (Figure 6b) you will see that the absolute value is around 5, around 3.5 standard deviation far away from the prior mean value. In your case, were you are using a prior standard deviation of 8.33, the difference between avg_rating_value and avg_converged_value is a little more than 2 standard deviation.

In summary, may be there is a good reason for this, but I would need more details to give you an accurate answer.

You can perform a sanitary check computing the geometric mean of the leave-one-out predictions: math.exp(h.log_evidence()/h.size). This value should be above 0.5 (see Table 1).

The paper again: (download paper).

jake-smart commented 1 year ago

Thank you for the prompt reply!

I agree that absolute value of estimates is arbitrary and internal comparisons are what matters - what is notable to me is that all values are rescaled from 25 downward, there are no outlier values of remaining above the original distributions initial centroid.

For every actor in the model:


Before:28.74575140707625 After:14.245250129905385
Before:18.25477928241498 After:9.230781524773896
Before:27.96226703024733 After:13.740740281474809
Before:21.552413998418448 After:11.563445279595124
Before:23.65400948219726 After:20.93170542530581
Before:29.685434392344057 After:15.325338774240171
Before:13.710841481290371 After:8.242311080821315
Before:18.214588787231257 After:9.448924341899158
Before:27.446405774082844 After:12.300559042439474
Before:15.976578186284092 After:10.36199698339829
Before:21.234840631019868 After:14.643414727968421
Before:13.678672044819482 After:5.485871740216972
Before:38.84394719246628 After:20.423970768707527
Before:4.655008329577511 After:4.409999534504543
Before:19.170669299548813 After:14.570578981859658
Before:28.938249450217494 After:15.67572371341015
Before:14.361425556252467 After:4.106513540126579
Before:26.706706440425815 After:13.681077665208782
Before:29.985600427125284 After:15.527002537607865
Before:14.667462560830481 After:9.11469516114774
Before:36.08642905283827 After:16.4778038340357
Before:28.187244775180275 After:15.692183401008398
Before:28.416637679525195 After:16.9733604628402```

Convergence pulls them all downward

Also in re: evidence - based on the paper only certain subsets resulted in log sum of evidence > 0.5 and it looked to me like football was below .4. Is there more to that as a baseline for a sufficiently tuned model or is it just using the results from the papers data sets?
glandfried commented 1 year ago

Hi!,

That's really interesting. I would like to see the data set. One question I would ask myself is what would happen if we perform an online convergence. This means, add one data point (or a whole batch) at a time after convergence (or some iterations). This function is not yet implemented I the python package, it is only implemented in the Julia package. I think it is relevant enough to add it.

Regarding the evidence, yes! When there are more than 2 outcomes (possibility of tie), the evidence will be lower than 0.5, that's Ok. Excuse me, i didn't explain that point well. Evidence behaves differently from most ad-hoc cost functions that are commonly used to evaluate models. Because the evidence is the product of prior predictions (rather than a sum or an arithmetic average) if there is a single zero in the sequence, the evidence becomes zero forever (as Poppers' intuition). In the paper, we use the geometric mean because it is more intuitive: the overall evidence would be the same if we use the geometric mean to replace each predictions of the product. It is natural that if we are predicting the outcome of games between individuals with the same skill the typical prediction is close to 0.5.

I remain attentive. Kind regards.