Running curl -fLO https://lichess.org/api/games/user/Lichess/imports exports the games the "user" Lichess imported (in particular, this seems to be the only way to get the masters database that lichess uses?). Note that this takes 22 hours to download, so probably don't reproduce this by doing what I did.
Some of the games exported like this are missing newlines between them, the Event tag for the 2nd game starts on the same line as the moves/outcome tag of the first game. This occurs 892960 games out of 1627601 if my regex is correct.
Here's a portion of the file where I first noticed it (scroll right to see the Event tags).
I used the following regex to count instances of this (matching the tail end of the outcome line, following by the Event tag) '[012*]\[Event '
I've uploaded the complete pgn file (gziped to 374M, uncompressed 1.1G) here for anyone who wants to see it, or just wants the data.
PS. I hope downloading this didn't impose too much load on the servers. I did get a bit worried when I realized how slow it was going.
PPS. sed -E 's/([012*])(\[Event )/\1\n\2/g' imports.pgn > imports-newlines.pgn should fix the file, if you're interested in using it for the data and not here for the actual bug.
Edit: Updated regex for games that end with result "" (there are 4, all of which also suffer from this bug). PPPS. What does `` mean anyways?
Running
curl -fLO https://lichess.org/api/games/user/Lichess/imports
exports the games the "user" Lichess imported (in particular, this seems to be the only way to get the masters database that lichess uses?). Note that this takes 22 hours to download, so probably don't reproduce this by doing what I did.Some of the games exported like this are missing newlines between them, the
Event
tag for the 2nd game starts on the same line as the moves/outcome tag of the first game. This occurs 892960 games out of 1627601 if my regex is correct.Here's a portion of the file where I first noticed it (scroll right to see the
Event
tags).I used the following regex to count instances of this (matching the tail end of the outcome line, following by the Event tag)
'[012*]\[Event '
I've uploaded the complete pgn file (gziped to 374M, uncompressed 1.1G) here for anyone who wants to see it, or just wants the data.
PS. I hope downloading this didn't impose too much load on the servers. I did get a bit worried when I realized how slow it was going.
PPS.
sed -E 's/([012*])(\[Event )/\1\n\2/g' imports.pgn > imports-newlines.pgn
should fix the file, if you're interested in using it for the data and not here for the actual bug.Edit: Updated regex for games that end with result "" (there are 4, all of which also suffer from this bug). PPPS. What does `` mean anyways?