Closed johncs999 closed 3 months ago
We randomly sample 10M games from the February 2023 split from https://database.lichess.org. For testing, we use 1k games randomly sampled from March 2023 (see section 2.1 of our paper).
Thanks. BTW, how to obtain a subset of 10^4 games from the current data? For behavioral cloning, can I simply use the first 589,130 records?
We use Apache Beam to process the file in parallel. To select a fixed number of games, we use beam.combiners.Sample.FixedSizeGlobally(num_games)
.
Hi there, thanks for the interesting work.
I'm curious about the training and test games, as I noticed there are 98M games in the 2023-February split on lichess. Did you use the first 10M as training games and the following 1k games for testing?