jcw024 / lichess_database_ETL

pipeline for migrating lichess data into postgresql
208 stars 9 forks source link

How do you define "player's first stable rating"? #2

Open aramcb opened 1 year ago

aramcb commented 1 year ago

I couldn't tell from the code how this was defined. Have I missed something?

From what I can tell from lichess' open database files, they do not indicate presence or absence of provisional, i.e., "?" rating.

jcw024 commented 1 year ago

You are right about the provisional ratings, the dataset doesn't mark provisional rating with a question mark. This was also something I wondered about, since there wasn't a foolproof way to identify a provisional rating based on the individual data points. The way I ended up interpreting a provisional rating game during the analysis was when the ratingdiff (rating gain/loss after the game) was greater than 30. The rating gain should get smaller as more games are played. For example: https://github.com/jcw024/lichess_database_ETL/blob/main/analytics/psycopg2_query.py#L185 You could set the cutoff higher/lower if you prefer, thinking back I think it would have been better to set it lower, maybe to 10 or 15. I think I didn't want to exclude rated games between players with a large rating gap where the lower rated player wins against the higher rated player, but those situations are probably extremely rare.

aramcb commented 1 year ago

. The rating gain [RatingDiff] should get smaller as more games are played.

  1. Can you elaborate why this is the case? I'm looking at some of my first game on Lichess and this seems accurate: i.e., a 1500? vs a 1567 results in a much larger RatingDiff (gain/loss) for the provisional (?) player than the stable player.
  2. If this is true (that RatingDiff are larger for players with provisional players), then this is a good method to prevent players who are accurately rated at 1500 (the starting) from gaining/losing too much Elo simply because their accurate rating is the default and they are winning/losing against players with provisional ratings.

Also: According to Lichess FAQ:

[Provisional]...it means that the Glicko-2 deviation is greater than 11

PS: thanks for this very cool analysis! I'm awed by the work you put into assembling the relational database {I'm limiting my own analysis to one year as the setup overhead looks very daunting!}

jcw024 commented 1 year ago

I think there's a pretty good explanation for how the rating diff changes over time/games under another Lichess FAQ here.

I'm glad you enjoyed the analysis, it's been over a year since I worked on this, so happy people are still finding it useful/interesting. I think the main hurdle is the setup, but once it's setup it's the same whether you download 1 year or multiple years. It just depends on how much storage you have on your machine and how long you want to wait for it to download/process everything. More recent years have a lot more data than earlier years and will take more memory/time to process.