niklasf / python-chess

A chess library for Python, with move generation and validation, PGN parsing and writing, Polyglot opening book reading, Gaviota tablebase probing, Syzygy tablebase probing, and UCI/XBoard engine communication
https://python-chess.readthedocs.io/en/latest/
GNU General Public License v3.0
2.35k stars 512 forks source link

`chess.pgn.read_headers` inserts empty header entries related to newlines and empty movetext #1087

Open MatijaSi opened 1 month ago

MatijaSi commented 1 month ago

I am trying to parse a largeish (7,000,000 games) pgn using read_headers. However, I only managed to scan 84,039 games before it stopped as if it finished (no error message).

I managed to narrow it down to this testcase:

testcase = """
[Event "World Cup"]
[Site "Khanty-Mansiysk RUS"]
[Date "2009.11.29"]
[Round "3.4"]
[White "Li Chao2"]
[Black "Gashimov,V"]
[Result "0-1"]
[WhiteElo "2596"]
[BlackElo "2758"]
[EventDate "2009.11.21"]
[HashCode "00000000"]
[TotalPlyCount "0"]

{ Both Chinese players were late to the board for game two and were
defaulted }

0-1

[Event "World Cup"]
[Site "Khanty-Mansiysk RUS"]
[Date "2009.11.29"]
[Round "3.4"]
[White "Naiditsch,A"]
[Black "Svidler,P"]
[Result "1/2-1/2"]
[WhiteElo "2689"]
[BlackElo "2754"]
[ECO "C97"]
[Opening "Ruy Lopez"]
[Variation "closed, Chigorin defence"]
[EventDate "2009.11.21"]
[HashCode "312a25cf"]
[TotalPlyCount "121"]

1. e4 e5 2. Nf3 Nc6 3. Bb5 a6 4. Ba4 Nf6 5. O-O Be7 6. Re1 b5 7. Bb3 O-O 8.
c3 d6 9. h3 Na5 10. Bc2 c5 11. d4 Qc7 12. d5 c4 13. Nbd2 Nb7 14. Nf1 Nc5
15. Kh1 Bd7 16. Ng3 Ne8 17. Nh2 Bh4 18. Rf1 Qd8 19. Nf3 Bxg3 20. fxg3 f5
21. Ng5 h6 22. Ne6 Bxe6 23. dxe6 fxe4 24. Rxf8+ Kxf8 25. Qh5 Qf6 26. Be3
Nd3 27. Qg4 Nc7 28. Qxe4 Qxe6 29. Qh7 Qg8 30. Bxd3 cxd3 31. Qxd3 Qe6 32.
Qh7 Qg8 33. Rf1+ Ke7 34. Qg6 Rf8 35. Rd1 Rf6 36. Qg4 g5 37. Kh2 Qxa2 38.
Qc8 Ne8 39. Bb6 Kf8 40. Bd8 Rf7 41. h4 gxh4 42. Bxh4 Qc4 43. Qxa6 Kg7 44.
Qa8 Qg4 45. Ra1 Qd7 46. Qb8 Qc6 47. b4 Kg6 48. Ra8 Ng7 49. Ra2 Qd5 50. Ra6
Rd7 51. Rb6 Nf5 52. Rxb5 Qf7 53. Rb6 Kh7 54. Qa8 Nxh4 55. gxh4 Qf4+ 56. Kh1
Qxh4+ 57. Kg1 Rg7 58. Qf3 Qe1+ 59. Kh2 Qh4+ 60. Kg1 Qe1+ 61. Kh2 1/2-1/2
"""

import io

f = io.StringIO(testcase)

while True:
    headers = chess.pgn.read_headers(f)
    print(headers)

    if not headers:
        break

Which prints:

Headers(Event='World Cup', Site='Khanty-Mansiysk RUS', Date='2009.11.29', Round='3.4', White='Li Chao2', Black='Gashimov,V', Result='0-1', WhiteElo='2596', BlackElo='2758', EventDate='2009.11.21', HashCode='00000000', TotalPlyCount='0')
Headers()
MatijaSi commented 1 month ago

Investigating a bit further, there seems to be some issue related to newlines between games.

For example:

testcase = """
[Event "World Cup"]
[Site "Khanty-Mansiysk RUS"]
[Date "2009.11.29"]
[Round "3.4"]
[White "Li Chao2"]
[Black "Gashimov,V"]
[Result "0-1"]
[WhiteElo "2596"]
[BlackElo "2758"]
[EventDate "2009.11.21"]
[HashCode "00000000"]
[TotalPlyCount "0"]

{ Both Chinese players were late to the board for game two and were
defaulted }

0-1

[Event "World Cup"]
[Site "Khanty-Mansiysk RUS"]
[Date "2009.11.29"]
[Round "3.4"]
[White "Naiditsch,A"]
[Black "Svidler,P"]
[Result "1/2-1/2"]
[WhiteElo "2689"]
[BlackElo "2754"]
[ECO "C97"]
[Opening "Ruy Lopez"]
[Variation "closed, Chigorin defence"]
[EventDate "2009.11.21"]
[HashCode "312a25cf"]
[TotalPlyCount "121"]

1. e4 e5 2. Nf3 Nc6 3. Bb5 a6 4. Ba4 Nf6 5. O-O Be7 6. Re1 b5 7. Bb3 O-O 8.
c3 d6 9. h3 Na5 10. Bc2 c5 11. d4 Qc7 12. d5 c4 13. Nbd2 Nb7 14. Nf1 Nc5
15. Kh1 Bd7 16. Ng3 Ne8 17. Nh2 Bh4 18. Rf1 Qd8 19. Nf3 Bxg3 20. fxg3 f5
21. Ng5 h6 22. Ne6 Bxe6 23. dxe6 fxe4 24. Rxf8+ Kxf8 25. Qh5 Qf6 26. Be3
Nd3 27. Qg4 Nc7 28. Qxe4 Qxe6 29. Qh7 Qg8 30. Bxd3 cxd3 31. Qxd3 Qe6 32.
Qh7 Qg8 33. Rf1+ Ke7 34. Qg6 Rf8 35. Rd1 Rf6 36. Qg4 g5 37. Kh2 Qxa2 38.
Qc8 Ne8 39. Bb6 Kf8 40. Bd8 Rf7 41. h4 gxh4 42. Bxh4 Qc4 43. Qxa6 Kg7 44.
Qa8 Qg4 45. Ra1 Qd7 46. Qb8 Qc6 47. b4 Kg6 48. Ra8 Ng7 49. Ra2 Qd5 50. Ra6
Rd7 51. Rb6 Nf5 52. Rxb5 Qf7 53. Rb6 Kh7 54. Qa8 Nxh4 55. gxh4 Qf4+ 56. Kh1
Qxh4+ 57. Kg1 Rg7 58. Qf3 Qe1+ 59. Kh2 Qh4+ 60. Kg1 Qe1+ 61. Kh2 1/2-1/2
"""

import io

f = io.StringIO(testcase)

games = []
while True:
    headers = chess.pgn.read_headers(f)
    games.append(headers)

    if headers == None:
        break

Leads to games being (note the empty Headers() between both "real" games):

[Headers(Event='World Cup', Site='Khanty-Mansiysk RUS', Date='2009.11.29', Round='3.4', White='Li Chao2', Black='Gashimov,V', Result='0-1', WhiteElo='2596', BlackElo='2758', EventDate='2009.11.21', HashCode='00000000', TotalPlyCount='0'),
 Headers(),
 Headers(Event='World Cup', Site='Khanty-Mansiysk RUS', Date='2009.11.29', Round='3.4', White='Naiditsch,A', Black='Svidler,P', Result='1/2-1/2', WhiteElo='2689', BlackElo='2754', ECO='C97', Opening='Ruy Lopez', Variation='closed, Chigorin defence', EventDate='2009.11.21', HashCode='312a25cf', TotalPlyCount='121'),
 None]

While file from original issue:

testcase = """
[Event "World Cup"]
[Site "Khanty-Mansiysk RUS"]
[Date "2009.11.29"]
[Round "3.4"]
[White "Li Chao2"]
[Black "Gashimov,V"]
[Result "0-1"]
[WhiteElo "2596"]
[BlackElo "2758"]
[EventDate "2009.11.21"]
[HashCode "00000000"]
[TotalPlyCount "0"]

{ Both Chinese players were late to the board for game two and were
defaulted }

0-1

[Event "World Cup"]
[Site "Khanty-Mansiysk RUS"]
[Date "2009.11.29"]
[Round "3.4"]
[White "Naiditsch,A"]
[Black "Svidler,P"]
[Result "1/2-1/2"]
[WhiteElo "2689"]
[BlackElo "2754"]
[ECO "C97"]
[Opening "Ruy Lopez"]
[Variation "closed, Chigorin defence"]
[EventDate "2009.11.21"]
[HashCode "312a25cf"]
[TotalPlyCount "121"]

1. e4 e5 2. Nf3 Nc6 3. Bb5 a6 4. Ba4 Nf6 5. O-O Be7 6. Re1 b5 7. Bb3 O-O 8.
c3 d6 9. h3 Na5 10. Bc2 c5 11. d4 Qc7 12. d5 c4 13. Nbd2 Nb7 14. Nf1 Nc5
15. Kh1 Bd7 16. Ng3 Ne8 17. Nh2 Bh4 18. Rf1 Qd8 19. Nf3 Bxg3 20. fxg3 f5
21. Ng5 h6 22. Ne6 Bxe6 23. dxe6 fxe4 24. Rxf8+ Kxf8 25. Qh5 Qf6 26. Be3
Nd3 27. Qg4 Nc7 28. Qxe4 Qxe6 29. Qh7 Qg8 30. Bxd3 cxd3 31. Qxd3 Qe6 32.
Qh7 Qg8 33. Rf1+ Ke7 34. Qg6 Rf8 35. Rd1 Rf6 36. Qg4 g5 37. Kh2 Qxa2 38.
Qc8 Ne8 39. Bb6 Kf8 40. Bd8 Rf7 41. h4 gxh4 42. Bxh4 Qc4 43. Qxa6 Kg7 44.
Qa8 Qg4 45. Ra1 Qd7 46. Qb8 Qc6 47. b4 Kg6 48. Ra8 Ng7 49. Ra2 Qd5 50. Ra6
Rd7 51. Rb6 Nf5 52. Rxb5 Qf7 53. Rb6 Kh7 54. Qa8 Nxh4 55. gxh4 Qf4+ 56. Kh1
Qxh4+ 57. Kg1 Rg7 58. Qf3 Qe1+ 59. Kh2 Qh4+ 60. Kg1 Qe1+ 61. Kh2 1/2-1/2
"""

import io

f = io.StringIO(testcase)

games = []
while True:
    headers = chess.pgn.read_headers(f)
    games.append(headers)

    if headers == None:
        break

Leads to games being (again plenty of empties):

[Headers(Event='World Cup', Site='Khanty-Mansiysk RUS', Date='2009.11.29', Round='3.4', White='Li Chao2', Black='Gashimov,V', Result='0-1', WhiteElo='2596', BlackElo='2758', EventDate='2009.11.21', HashCode='00000000', TotalPlyCount='0'),
 Headers(),
 Headers(),
 Headers(Event='World Cup', Site='Khanty-Mansiysk RUS', Date='2009.11.29', Round='3.4', White='Naiditsch,A', Black='Svidler,P', Result='1/2-1/2', WhiteElo='2689', BlackElo='2754', ECO='C97', Opening='Ruy Lopez', Variation='closed, Chigorin defence', EventDate='2009.11.21', HashCode='312a25cf', TotalPlyCount='121'),
 Headers(),
 None]

So my code in original issue is slightly wrong: it looks at headers being false-ish:

if not headers:
    break

instead of comparing them to None:

if headers is None:
    break

However this is probably still bug in library, since empty line probably shouldn't be empty game. Additionaly it's somehow related to movetext being empty, since if we provide it we get different return:

testcase = """
[Event "World Cup"]
[Site "Khanty-Mansiysk RUS"]
[Date "2009.11.29"]
[Round "3.4"]
[White "Li Chao2"]
[Black "Gashimov,V"]
[Result "0-1"]
[WhiteElo "2596"]
[BlackElo "2758"]
[EventDate "2009.11.21"]
[HashCode "00000000"]
[TotalPlyCount "0"]

1. e4 e5 0-1

[Event "World Cup"]
[Site "Khanty-Mansiysk RUS"]
[Date "2009.11.29"]
[Round "3.4"]
[White "Naiditsch,A"]
[Black "Svidler,P"]
[Result "1/2-1/2"]
[WhiteElo "2689"]
[BlackElo "2754"]
[ECO "C97"]
[Opening "Ruy Lopez"]
[Variation "closed, Chigorin defence"]
[EventDate "2009.11.21"]
[HashCode "312a25cf"]
[TotalPlyCount "121"]

1. e4 e5 2. Nf3 Nc6 3. Bb5 a6 4. Ba4 Nf6 5. O-O Be7 6. Re1 b5 7. Bb3 O-O 8.
c3 d6 9. h3 Na5 10. Bc2 c5 11. d4 Qc7 12. d5 c4 13. Nbd2 Nb7 14. Nf1 Nc5
15. Kh1 Bd7 16. Ng3 Ne8 17. Nh2 Bh4 18. Rf1 Qd8 19. Nf3 Bxg3 20. fxg3 f5
21. Ng5 h6 22. Ne6 Bxe6 23. dxe6 fxe4 24. Rxf8+ Kxf8 25. Qh5 Qf6 26. Be3
Nd3 27. Qg4 Nc7 28. Qxe4 Qxe6 29. Qh7 Qg8 30. Bxd3 cxd3 31. Qxd3 Qe6 32.
Qh7 Qg8 33. Rf1+ Ke7 34. Qg6 Rf8 35. Rd1 Rf6 36. Qg4 g5 37. Kh2 Qxa2 38.
Qc8 Ne8 39. Bb6 Kf8 40. Bd8 Rf7 41. h4 gxh4 42. Bxh4 Qc4 43. Qxa6 Kg7 44.
Qa8 Qg4 45. Ra1 Qd7 46. Qb8 Qc6 47. b4 Kg6 48. Ra8 Ng7 49. Ra2 Qd5 50. Ra6
Rd7 51. Rb6 Nf5 52. Rxb5 Qf7 53. Rb6 Kh7 54. Qa8 Nxh4 55. gxh4 Qf4+ 56. Kh1
Qxh4+ 57. Kg1 Rg7 58. Qf3 Qe1+ 59. Kh2 Qh4+ 60. Kg1 Qe1+ 61. Kh2 1/2-1/2
"""

import io

f = io.StringIO(testcase)

games = []
while True:
    headers = chess.pgn.read_headers(f)
    games.append(headers)

    if headers == None:
        break

Leads to (note that now there is no Headers() between games, but one extra still got appended):

[Headers(Event='World Cup', Site='Khanty-Mansiysk RUS', Date='2009.11.29', Round='3.4', White='Li Chao2', Black='Gashimov,V', Result='0-1', WhiteElo='2596', BlackElo='2758', EventDate='2009.11.21', HashCode='00000000', TotalPlyCount='0'),
 Headers(Event='World Cup', Site='Khanty-Mansiysk RUS', Date='2009.11.29', Round='3.4', White='Naiditsch,A', Black='Svidler,P', Result='1/2-1/2', WhiteElo='2689', BlackElo='2754', ECO='C97', Opening='Ruy Lopez', Variation='closed, Chigorin defence', EventDate='2009.11.21', HashCode='312a25cf', TotalPlyCount='121'),
 Headers(),
 None]