lichess-org / lila

♞ the forever free, adless and open source chess server ♞
GNU Affero General Public License v3.0
15.07k stars 2.23k forks source link

Missing newline between games in export of imported games #11708

Closed gmorenz closed 1 year ago

gmorenz commented 1 year ago

Running curl -fLO exports the games the "user" Lichess imported (in particular, this seems to be the only way to get the masters database that lichess uses?). Note that this takes 22 hours to download, so probably don't reproduce this by doing what I did.

Some of the games exported like this are missing newlines between them, the Event tag for the 2nd game starts on the same line as the moves/outcome tag of the first game. This occurs 892960 games out of 1627601 if my regex is correct.

Here's a portion of the file where I first noticed it (scroll right to see the Event tags).

1. d4 Nf6 2. c4 e6 3. Nf3 d5 4. Nc3 c6 5. Bg5 h6 6. Bh4 dxc4 7. e4 g5 8. Bg3 b5 9. Be2 Bb7 10. h4 Rg8 11. hxg5 hxg5 12. a3 Nb d7 13. Qd2 g4 14. Ne5 Nxe5 15. Bxe5 Nd7 16. Bf4 a5 17. d5 cxd5 18. exd5 Qb6 19. Rh7 b4 20. dxe6 Qxe6 21. Nb5 Rc8 22. Rd1 Bxg2 23. Nc7+ Rxc7 24. Bxc7 bxa3 25. bxa3 Bxa3 26. Bxa5 Be7 27. Rh6 f6 28. Rh7 Rg5 29. Bb4 Nc5 30. Bxc5 Rxc5 31. Rh8+ Kf7 32. Rh7+ Ke8 33. Rh8+ Kf7 1/2-1/2[Event "TCh-CZE Extraliga 2018-19"]
[Site "Czech Republic CZE"]
[Date "2018.11.04"]
[Round "2.6"]
[White "Babula, V."]
[Black "Caletka, R."]
[Result "1/2-1/2"]
[WhiteElo "2522"]
[BlackElo "2305"]
[WhiteTeam "1. Novoborsky SK"]
[BlackTeam "SK Slavoj Poruba"]

1. d4 Nf6 2. c4 e6 3. Nc3 Bb4 4. e3 O-O 5. Bd3 d5 6. Nf3 c5 7. O-O cxd4 8. exd4 dxc4 9. Bxc4 b6 10. Bg5 Bb7 11. Re1 Nbd7 12. Rc1 Qb8 13. Bb3 Rc8 14. Qd3 a6 15. Ne5 Nxe5 16. dxe5 Ne8 17. a3 Bf8 18. h4 Bc6 19. h5 Qb7 20. Bc2 g6 21. hxg6 hxg6 22. Qh3 Bg7 23. Re3 b5 24. Rce1 Rab8 25. Bb1 a5 26. Ne2 b4 27. axb4 Qxb4 28. Nf4 Qb7 29. Qh4 Qxb2 30. Bxg6 fxg6 31. Nxg6 Rc7 32. Rh3 Qb4 33. Qh7+ Kf7 34. Nh8+ Kf8 35. Ng6+ Kf7 36. Rc1 Be4 37. Rf3+ Bxf3 38. Nh8+ Kf8 39. Ng6+ Kf7 40. Nh8+ Kf8 41. Ng6+ Kf7 42. Nh8+ 1/2-1/2[Event "Basingstoke e2e4 op"]
[Site "Basingstoke"]
[Date "2012.10.31"]
[Round "9"]
[White "Roberson, Peter T"]
[Black "Holland, James P"]
[Result "1-0"]
[WhiteElo "2357"]
[BlackElo "2251"]

1. e4 c5 2. c3 b6 3. d4 Bb7 4. Nd2 e6 5. Ngf3 d6 6. Bd3 Ne7 7. h4 Nbc6 8. a3 h5 9. b4 Ng6 10. Nb3 Be7 11. Bg5 Qc7 12. Rc1 a6 13. Be3 Nb8 14. Nbd2 Nd7 15. Nf1 Nf6 16. Ng3 Bc6 17. Qe2 Qb7 18. Bd2 Nf8 19. Ng5 g6 20. O-O Ng8 21. a4 b5 22. bxc5 dxc5 23. Rb1 f6 24. Nf3 Rb8 25. axb5 axb5 26. Bf4 e5 27. Nxe5 fxe5 28. Bxe5 Nf6 29. Bxb8 Qxb8 30. Bxb5 Bxb5 31. Rxb5 Qf4 32. e5 Nd5 33. Rb8+ Bd8 34. Qb5+ Nd7 35. e6 O-O 36. exd7 Qxh4 37. Qc4 Qg5 38. Ne4 Qf5 39. Rxd8 1-0[Event "Bundesliga 8990"]

I used the following regex to count instances of this (matching the tail end of the outcome line, following by the Event tag) '[012*]\[Event '

I've uploaded the complete pgn file (gziped to 374M, uncompressed 1.1G) here for anyone who wants to see it, or just wants the data.

PS. I hope downloading this didn't impose too much load on the servers. I did get a bit worried when I realized how slow it was going.

PPS. sed -E 's/([012*])(\[Event )/\1\n\2/g' imports.pgn > imports-newlines.pgn should fix the file, if you're interested in using it for the data and not here for the actual bug.

Edit: Updated regex for games that end with result "" (there are 4, all of which also suffer from this bug). PPPS. What does `` mean anyways?

ornicar commented 1 year ago

* means the game is not finished