jerkeeler / aoestats-redux-community

aoestats project planning and issue reporting
https://aoestats.io
1 stars 0 forks source link

Feature Request - Modelling #9

Open gowerc opened 1 year ago

gowerc commented 1 year ago

Heya,

I'm the author/developer of ageofstatistics.com . I no longer have the time to maintain the site (especially after the changes / lack of stability with aoe2.net). I was wondering if you would be open to porting across some of the features to your site now that you are back developing again :) ?

Happy to discuss more but I think the main one I would want to stress is the use of logistic regression modelling in order to account Elo + other covariates.

An unfortunate fact about win rates is that if you don't include explanatory variables (i.e. just calculate naive win rates) they are biased towards 50%, given that Elo is such an influential part of matches it means most of the win rates you present will be underestimated (albeit the relative ordering / ranks should be preserved).

But yer if this is of interest to you (plus any other features from my site that you would like to incorporate) I'd be more than happy to chat. If not, then no hard feelings please feel free to ignore and close this issue :)

jerkeeler commented 1 year ago

Thanks for reaching out @gowerc! First, I LOVE your site. Thank you so much for your work on it.

I would love to incorporate some of your ideas into my site. I agree that your logistic regression model is a much better model than my naive win rates. I really appreciate your methods section. I'm going to look to incorporate your model into my win rate calculations, if that's ok with you? Are there any other confounding variables that you think would be appropriate to incorporate?

I'm also curious if you think the "Averaged Win Rates" as you put it on your website would be more appropriate? I guess it's a bit up to interpretation. My gut says the logistic regression model suffices.

I would also love to add some of those graphs that include all of the civs on one plot. I've been planning something similar for a bit so I will definitely use your site as inspiration for some more graphs.

Want me to give you credit somewhere on the site? Want a social link posted in the FAQ or footer or the like?

jerkeeler commented 1 year ago

Thinking about it a bit more, I could do this same thing and add in map as a parameter to get more accurate map win rate results. 🤔 That seems pretty cool.

jerkeeler commented 1 year ago

Also I'm assuming once you fit your model you asked it to predict the civ's win rate given a elo difference of 0 to get the "overall" win rate? Or am I misunderstanding?

gowerc commented 1 year ago

I would love to incorporate some of your ideas into my site. I agree that your logistic regression model is a much better model than my naive win rates. I really appreciate your methods section.

If you need help with any of the methods do feel free to ask!

I'm going to look to incorporate your model into my win rate calculations, if that's ok with you?

I don't own logistic regression :laughing: so definitely fine with me :smile:

Are there any other confounding variables that you think would be appropriate to incorporate?

(apologies in advance for the wall of text here)

A simple list of things that would realistically affect the outcome of a match

But yer its basically impossible to model the above because there just isn't enough data for each player (plus there are way too many civs). Below are some ideas that I had to try and capture the above (albeit far from perfect):

I'm also curious if you think the "Averaged Win Rates" as you put it on your website would be more appropriate? I guess it's a bit up to interpretation. My gut says the logistic regression model suffices.

Just to be clear both win rate types that I show were created by regression models, the difference is basically how you weight them. In terms of which one is better there is no correct answer, they just show different things. The "Averaged Win Rate" basically shows your civs expected win rate assuming your opponent is selecting "Random" whilst the normal win rate shows your civ's win rate assuming your opponent is selecting civ's based upon the observed pick rates (e.g. they are more likely to be Franks :smile: )

I would also love to add some of those graphs that include all of the civs on one plot. I've been planning something similar for a bit so I will definitely use your site as inspiration for some more graphs.

That would be awesome if you could!! One of the original inspirations for creating my site was I wanted a more visual representation of the data. Your site was amazing for the raw information but (at least personally) I always found plots better for quick visual comparisons.

Want me to give you credit somewhere on the site? Want a social link posted in the FAQ or footer or the like?

O no need for this at all, I mean if you want to feel free but I don't need / want any credit, I am just happy to see you are back developing as the community really benefits from a resource like yours :smile:

Thinking about it a bit more, I could do this same thing and add in map as a parameter to get more accurate map win rate results.

Thing is you would have to add it as a civ * map interaction term. Which is perfectly doable but the problem I found is the model becomes very hard to fit computationally with so many parameters. I was running it on a 32GB RAM machine and still running out of memory, I had to end up reducing the model and down sampling the data in a few cases. I started looking into more memory efficient implementations but didn't really get anyway :cry:

gowerc commented 1 year ago

Also I'm assuming once you fit your model you asked it to predict the civ's win rate given a elo difference of 0 to get the "overall" win rate

Ya pretty much this. Looking back over my code I essentially structured the data as 1 row per player per match with a columns civ , diff in elo, won e.g.

match ID civ diff in elo won
1 Franks +100 1
1 Cumans -100 0
2 Britons +64 0
2 Spanish -64 1
3 Huns +45 1
3 Spanish -45 0

For team games I just difference in mean team Elo (though yer note my above bullet point about how this could be better modified).

The model is then won ~ 0 + civ + diff_in_elo (using R's modelling notation, the "0" just means don't use an intercept term). Though R automatically expands out civ to be 1 column per civ. Not sure what language you use on the back end but Pythons patsy module can provide similar functionalities for helping to create proper design matricies from categorical data using the above formula notation.

Couple of additional points that came to mind:

jerkeeler commented 1 year ago

So I've been playing around with this model and while I think my code is correct, I don't see a large difference in the predicted win rates versus the mean win rate. But perhaps I'm doing something incorrectly and not setting up my model right?

Here's the output, wheredata_df is a dataframe where column w is whether the player won/loss c is civ number, "d" is difference in rating. C(c) indicates I'm treating the integer as a categorical variable. data_df has 809,582 observations and contains all players on the latest patch across all ratings.

In [74]: data_df.head()
Out[74]:
   w   c     d
0  0  28  16.0
1  1  40   4.0
2  0  16  -8.0
3  1  36   8.0
4  0  27 -22.0

In [75]: glm_mod = glm("w ~ 0 + C(c) + d", data_df, family=sm.families.Binomial()).fit()

In [76]: print(glm_mod.summary())
                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                      w   No. Observations:               809582
Model:                            GLM   Df Residuals:                   809539
Model Family:                Binomial   Df Model:                           42
Link Function:                  Logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:            -5.5389e+05
Date:                Mon, 10 Apr 2023   Deviance:                   1.1078e+06
Time:                        15:06:29   Pearson chi2:                 8.10e+05
No. Iterations:                     5   Pseudo R-squ. (CS):            0.01780
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
C(c)[1]       -0.0458      0.014     -3.257      0.001      -0.073      -0.018
C(c)[2]       -0.1219      0.029     -4.140      0.000      -0.180      -0.064
C(c)[3]        0.1114      0.015      7.504      0.000       0.082       0.140
C(c)[4]       -0.0389      0.021     -1.826      0.068      -0.081       0.003
C(c)[5]       -0.1105      0.011    -10.401      0.000      -0.131      -0.090
C(c)[6]        0.0625      0.015      4.218      0.000       0.033       0.092
C(c)[7]       -0.0416      0.016     -2.652      0.008      -0.072      -0.011
C(c)[8]       -0.0679      0.022     -3.105      0.002      -0.111      -0.025
C(c)[9]       -0.0576      0.013     -4.393      0.000      -0.083      -0.032
C(c)[10]       0.0029      0.017      0.177      0.859      -0.030       0.035
C(c)[11]      -0.1814      0.016    -11.283      0.000      -0.213      -0.150
C(c)[12]      -0.0037      0.015     -0.252      0.801      -0.032       0.025
C(c)[13]       0.0125      0.026      0.479      0.632      -0.039       0.064
C(c)[14]       0.0123      0.012      1.001      0.317      -0.012       0.036
C(c)[15]       0.1236      0.008     14.894      0.000       0.107       0.140
C(c)[16]      -0.0115      0.014     -0.803      0.422      -0.040       0.017
C(c)[17]       0.1025      0.020      5.142      0.000       0.063       0.142
C(c)[18]      -0.0511      0.014     -3.651      0.000      -0.078      -0.024
C(c)[19]       0.0289      0.014      2.054      0.040       0.001       0.056
C(c)[20]      -0.0340      0.020     -1.669      0.095      -0.074       0.006
C(c)[21]      -0.0556      0.016     -3.436      0.001      -0.087      -0.024
C(c)[22]      -0.0048      0.015     -0.321      0.748      -0.034       0.025
C(c)[23]      -0.0633      0.013     -4.713      0.000      -0.090      -0.037
C(c)[24]      -0.1296      0.020     -6.639      0.000      -0.168      -0.091
C(c)[25]       0.0714      0.011      6.458      0.000       0.050       0.093
C(c)[26]       0.0141      0.012      1.210      0.226      -0.009       0.037
C(c)[27]      -0.1454      0.020     -7.354      0.000      -0.184      -0.107
C(c)[28]       0.0313      0.017      1.792      0.073      -0.003       0.066
C(c)[29]      -0.0691      0.012     -5.555      0.000      -0.094      -0.045
C(c)[30]       0.0483      0.010      4.708      0.000       0.028       0.068
C(c)[31]      -0.0273      0.016     -1.750      0.080      -0.058       0.003
C(c)[32]      -0.0240      0.015     -1.628      0.104      -0.053       0.005
C(c)[33]       0.1264      0.012     10.484      0.000       0.103       0.150
C(c)[34]      -0.1130      0.018     -6.356      0.000      -0.148      -0.078
C(c)[35]      -0.0341      0.022     -1.577      0.115      -0.076       0.008
C(c)[36]      -0.0424      0.020     -2.145      0.032      -0.081      -0.004
C(c)[37]       0.0486      0.012      3.951      0.000       0.024       0.073
C(c)[38]      -0.1010      0.018     -5.616      0.000      -0.136      -0.066
C(c)[39]       0.0300      0.013      2.235      0.025       0.004       0.056
C(c)[40]       0.1146      0.013      9.067      0.000       0.090       0.139
C(c)[41]      -0.2210      0.018    -12.349      0.000      -0.256      -0.186
C(c)[42]       0.0698      0.016      4.418      0.000       0.039       0.101
d              0.0079   7.67e-05    103.119      0.000       0.008       0.008
==============================================================================

In [77]: glm_mod.predict(pd.DataFrame({"c": [Civ.franks.value, Civ.huns.value, Civ.chinese.value], "d": [0, 0, 0]}))
Out[77]:
0    0.530856
1    0.507214
2    0.454768
dtype: float64

The very naive win rates right now respectively are:

53.11
50.63
45.45
gowerc commented 1 year ago

Hard to truly say without access to the code & data though nothing you've shown above looks obviously wrong. Must admit I am a bit surprised. Will double check what I was seeing with my historic cuts of the data. Some general thoughts:

1) Originally I was fitting on much smaller cuts of data than what you have used (e.g. >1200 Elo on Open maps only) so would assume you would see greater swings in results here. Likewise I think when I remember looking at this in the past I was seeing swings in the order of 0.2-0.4 points. (e.g. 50.4 -> 50.8). Would be curious as to what you see if you filter to >=1200 Elo 2) Generally civ choice matters less at lower Elos (and they make up the bulk of the data). Part of me is wondering if difference in Elo matters less at lower levels as well ? I never looked into a variable Elo coefficient. 3) You can see from the Rsquared value though that the overall predictive power is quite poor, this means there is still a lot of opportunity to refine the model based upon some of the other factors I mentioned. (this was the main thing I was excited to try and push into but never got around to).

gowerc commented 1 year ago

Doing some quick sanity checks the theoretical win % of someone with a 25 Elo advantage is 53.59% and according to your model its coming out as 54.92% which is very much in the same ball park so would be very surprised if there was a mistake in your code.