Closed tissatussa closed 1 year ago
Hello! Thank you for this thorough test and for the rating estimate!
Your script looks nice, getting the ratings straight from the PGN file is very practical :) From what I've seen, you are reading and parsing the file by hand. As a suggestion, you may possibly simplify that part with python-chess and its PGN parsing component.
I have very little experience with multi-player ratings, but I have used the ordo tool in the past that does something similar to these calculations. This post from Lc0 folks introduces it and describes how to use it. Maybe these estimates can be cross-referenced with this tool.
On a final note, can you describe the testing conditions for the games (CPU, number of threads, hash size, books, adjudication, etc)?
My script is simple and it's a first version. It has just a few parameters, which could be set by commandline parameters, but not in this version. Also my (invented?) recursive process "new-rating-is-start-rating", to determine the rating of an engine-with-unknown-rating, could be automatic -- but by hand this process took me only a few steps : after about 6 times the final rating does not change anymore. This method seems natural to me, i think it gives the real rating of the one engine with unknown rating. I guess the number of games should be rather high though, to give an accurate result, so maybe 105 games is insufficient ..
I know the python-chess module, i used it in other scripts, it's very nice and has a lot of possibilities, especially to parse chess moves. However, i only needed to extract the player names, their ratings and the game result from the PGN header, and that's fairly simple, no full module like python-chess is needed. I was not aware of the existence of Ordo, neither programs with similar goals like BayesElo and EloStat, which are mentioned in the Ordo info page https://sites.google.com/site/gaviotachessengine/ordo . Ordo seems rather extensive and from what i read so far, it's not clear how to set the rating of each engine in the tournament, especially when using CuteChess (which i did, but the GUI version) because CuteChess lacks a rating field for an engine, as i stated. My script aims to calculate the rating of 1 engine, which has an unknown ("?") rating, by letting it play against many other engines, which DO have a rating. This method seems practical to me, i imagined it while i was curious of Pawn's rating, and for me it was easy to implement : i wrote this script in a few hours -- adding all the ratings to the PGN file was more time consuming (well, not really) :-)
Anyone can use my 105 PGN games to recalculate a rating for Pawn, using Ordo or another program, to verify the result .. i'm not planning to do so, i'm happy with the outcome.
About the testing conditions : the Linux command lscpu
gives this output for my HP Elite X2 notebook :
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 39 bits physical, 48 bits virtual
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 142
Model name: Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz
Stepping: 9
CPU MHz: 3100.001
CPU max MHz: 3100.0000
CPU min MHz: 400.0000
BogoMIPS: 5399.81
L1d cache: 64 KiB
L1i cache: 64 KiB
L2 cache: 512 KiB
L3 cache: 3 MiB
NUMA node0 CPU(s): 0-3
Vulnerability Itlb multihit: KVM: Mitigation: VMX unsupported
Vulnerability L1tf: Mitigation; PTE Inversion
Vulnerability Mds: Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Meltdown: Mitigation; PTI
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Full generic retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling
Vulnerability Srbds: Mitigation; Microcode
Vulnerability Tsx async abort: Not affected
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xt opology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d
[#] threads : i used 1 for all engines. hash size : i used 64 Mb for all engines. no opening books were involved, if an engine has such option, i put it OFF. regarding adjudication : i removed all games with a 'irregular' ending, like 'loss on time', 'disconnected', 'illegal move', etc.
Pawn has no rating .. i developed a method to calculate this : 2374 i've attached some files to reproduce this.
recently i did a lot of test games in CuteChess, after installing many (Linux based) engines, (very) weak and (very) strong. To see if an engine runs properly in CuteChess, i let it play against Pawn, from start position, with (mostly) 8 minutes per player plus 3 seconds bonus. In CuteChess all played games were saved in a PGN file, including their thinking time per move and its evaluation value, but not their rating : CuteChess has no data field for it .. so i added the fields 'WhiteElo' and 'BlackElo' to the PGN of each game (using the CCRL rating lists), where Pawn has an unknown rating by "?".
Then i created a simple Python terminal script which reads all results and thus calculates the rating of Pawn, as if the game list was a tournament with many (105) rounds and after each game the rating is updated. I used the official ELO calculation definition, which i found at https://herculeschess.com/how-chess-rating-is-calculated/ - it's all about the "expected score" due to the rating difference of the players.
the game result list is the output of the script when Pawn has a start rating of 2374 : (only then) its final rating is the same. you can change this start rating at the top of the script .. eg. 1000 gives a final rating of 2087 and 2000 gives 2312 .. so, by doing this recusively (now give start rating 2312, etc) i calculated that Pawn has rating 2375. My first guess was 2300 .. not too bad !
i'm not sure i did the calculation the right way. can anybody confirm or give a link to more / better info about this ?
Note 1: the K-factor depends on a 'masterrating' : above it K=16, under it K=32 .. i set this rating to 2400 (you can change this in the script) because i found no info about its real value. Note 2: by changing the order of the games, Pawn's final new rating will also change (a bit).
Pawn_games2.pgn.zip ratingcalc.py.zip games_105_scores.txt.zip How Chess Rating Is Calculated_ Crunching The Numbers - herculeschess.com.pdf