a "number of games" parameter (needs looking into: are we randomly pitting "players" against each other? Are we rather going through all possible games?
And returns a dictionary whose keys are model names and values are ELO ratings.
This part on the wiki page also seems relevant for implementation:
An example may help to clarify: Suppose player A has a rating of 1613...
Suggestion: Test using multiple small models: distilgpt2, gpt2, gpt2-medium, for example. Actually, it should be possibly to simply send in a list of e.g. three identical model names, too, right?
We should have a function which receives as arguments:
This part on the wiki page also seems relevant for implementation: