Ranking System Initial Investigation

amosborne commented 3 years ago

Develop some methodology for quantifying the effectiveness of the ranking system algorithm (in the MLP specification, currently proposed as Glicko2, potentially with some TBD modifications).

With Hiku's World Ranking as a benchmark, run the initial proposed algorithm against the test database and compare results to Hiku's rating system. I propose this to be done within a Jupyter Notebook with some plots and metrics to visualize rating distribution and how well the relative order and uncertainty correlates between the two systems.

Ultimately our own rating system goals will need to be defined with corresponding metrics to quantify algorithm effectiveness, but creating a simple visualization to compare systems and get a feel for the data will be a great first step.

amosborne commented 3 years ago

d659b8aa2adb6b3524a88c3199c8c1896cd01870 adds a new poetry run hikuwr --extract option to create two new databases for player rating experimentation/testing.

The datasets are entirely distinct. They are both from the surge of new players/matches when puyolobby.com was released. The dataset is split into two approximately equal sized portions, divided along community lines. Each dataset covers a two month timespan with about 1000 matches total across 250 players.

Note that the community algorithm used to split the datasets is non-deterministic and therefore the resulting databases will change each time the command is invoked.

amosborne commented 3 years ago

In 9252d5c69f5f00f62503ecfcc33c55115c9532cb I propose the first metric to validate any proposed rating algorithm: ranking order correlation. With Hiku's World Ranking as the benchmark, the two extracted puyolobby datasets are run through the Donguri Gaeru rating algorithm, the ranking order of the players are compared, and the Pearson correlation coefficient is computed. A successful rating algorithm will yield a correlation close to 1.

At present, the Donguri Gaeru rating algorithm is entirely random and the resulting correlation is near zero. A plot is also provided to visualize the correlation.

mazziechai commented 3 years ago

I'm not sure of what use a random algorithm will have for testing. I was under the impression that we were using Glicko-2 as its a popular, easy, and well tested option; from the looks of this, it seems like another approach is being taken—as in, we are developing our own system? I may have misunderstood the intentions, so it'd be great if you could clarify.

amosborne commented 3 years ago

The random algorithm was literally just so I could generate a plot and test the code wrote so far.

mazziechai commented 3 years ago

Thanks, I was unsure. If it's alright, I can go ahead and write a Glicko-2 implementation to use.

mazziechai commented 3 years ago

I did some research into ways we could represent a ranking for a leaderboard, and I found GLIXARE.

GLIXARE is a formula used by TETR.IO and some other games with rating systems to approximate player skill as a single number, which is used to get a percentage of how likely a player is to win a match against an opponent. It's outlined here: https://www.smogon.com/forums/threads/gxe-glixare-a-much-better-way-of-estimating-a-players-overall-rating-than-shoddys-cre.51169/

The formula in question (in Python syntax):

round(10000 / (1 + 10**(((1500 - rating) * pi / sqrt(3 * log(10)**2 * rd**2 + 2500 * (64 * pi**2 + 147 * log(10)**2)))))) / 100

If the cap for getting a rank through this formula is less than 100 RD, then it would be pretty suitable for figuring out a more concrete number for rankings. Try putting this with Hiku's World Ranking to see if it works better.

Thoughts?

amosborne commented 3 years ago

This is a neat idea. I’m not sure how many people in my test data actually get below 100 RD, but I will try this and see.

amosborne commented 3 years ago

I propose to close this issue upon merge of the linked pull request. Hiku's World Ranking rating algorithm has been implemented and validated according to the following:

The original Java implementation and the Python adaptation are executed side-by-side and demonstrated to produce the same result. The result that was posted on the wiki is also provided for comparison as well.
Algorithm convergence to a stable rating is plotted over iterations; the algorithm has been adapted to terminate on a convergence criteria as well as an iteration limit.
The ability of the rating algorithm to predict match results is demonstrated in a few cases and an automated script is provided to generate additional charts if necessary. Common matchups from the test database are listed in a file for convenience.

Moving forward, the most important decision to be made is how to communicate player ratings/rankings on the website in order to best communicate a player's progression. For the initial release it may be best to simply include on the website both ratings and rankings, and also include a short write-up of how to interpret those numbers.

mazziechai / DonguriGaeru

Ranking System Initial Investigation #6