nguyenpham / ocgdb

Open Chess Game Database Standard (OCGDB)
MIT License
31 stars 8 forks source link

Request option to detect doubles when games are included in other games. #26

Closed Jonathan003 closed 2 years ago

Jonathan003 commented 2 years ago

Screenshot found doubles with Scid

This is an example where Scid finds a duplicated game because on game is included in another game. Ath the moment ocgdb don't detect these doubles. Scid only detect these count of doubles when the player names are identical I also want an option to detect these doubles when player names are different or other information like event is different. Than I want ocgdb to keep the better game, (longer game, more recent, higher elo, longer time control ,etc). Maybe it are technically not exact doubles. But I don't see how it can be useful to keep games with exact the same moves, and exact the same results, together in a database. Doubles where one game is included in another happens quite often in human tournaments when using DGT boards, sometimes the feed gets an extra move or a database feed like TWIC does an update one week and provides a game correction the next week.

nguyenpham commented 2 years ago

That is doable.

However, it may take a lot of time, slow painfully. The reason is that without extra conditions (such as same players), the app has to compare a game with all other games, the complexity becomes O(n!), much larger than the O(n) of our current function to check duplicates.

Jonathan003 commented 2 years ago

With Scid it goes very fast to find these doubles, where games are included in other games. Even directly with a pgn file. Maybe it goes fast because the first 4 letters of the player names have to be the same with Scid? If that's the problem than maybe it would be useful to first search for doubles with ocgdb with the standard setting. And then do a search with ocgdb to search for games included in other games with the same user names. Or with the same first four letters of the player names.

Screenshot scid find double

These are the settings when searching for doubles in Scid. I like these setting except that there is no option to find doubles when the first 4 letters of the player names are different.

Jonathan003 commented 2 years ago

I did some test with Scid. I removed al the player names for white and for black in a huge pgn database with Chess Assistant 20. Than I searched this pgn database for duplicates. It was still very fast and doubles because games are included in other games where also detected.

nguyenpham commented 2 years ago

I have implemented the request. It could run very fast, almost as fast as the normal duplicate-check function.

Run it as: ocgdb -db big.ocgdb.db3 -dup -o embededgames -r report.txt

Committed: https://github.com/nguyenpham/ocgdb/commit/8726834462c0037d7b8fe97bc6228e28feca31db

Jonathan003 commented 2 years ago

Thanks! Can I download the updated ocgdb tool somewhere to give it a try?

nguyenpham commented 2 years ago

We have released a new version Beta 7 https://github.com/nguyenpham/ocgdb/releases/tag/VersionBeta7

Jonathan003 commented 2 years ago

I have tried it with with ocgdb Beta 7

With these two games

[Event "Barcelona Ideal Clave op 21st"] [Site "Barcelona"] [Date "2017.10.07"] [Round "2"] [White "Rojas Nunez, Alberto"] [Black "Lopez Gomez, Nicolas"] [Result "0-1"] [WhiteElo "1276"] [BlackElo "1526"] [EventDate "2017.09.30"] [PlyCount "54"] [EventType "swiss"] [EventRounds "8"] [EventCountry "ESP"] [SourceTitle "CBM 181 Extra"] [Source "ChessBase"] [SourceDate "2017.12.12"] [SourceVersion "1"] [SourceVersionDate "2017.12.12"] [SourceQuality "1"]

1.e4 e5 2.Nf3 Nc6 3.d4 exd4 4.Nxd4 Ne5 5.Nc3 Bb4 6.a3 Bxc3+ 7.bxc3 d6 8. Bd3 Nf6 9.f4 Nxd3+ 10.cxd3 O-O 11.O-O Re8 12.f5 d5 13.e5 Rxe5 14.Qf3 c5 15.Bf4 Re7 16.Ne2 Bxf5 17.Ng3 Bg4 18.Qf2 b6 19.h3 Bh5 20.Rae1 Bg6 21.Qf3 Qd7 22.c4 Rae8 23.Rd1 d4 24.h4 h6 25.Bxh6 Ng4 26.Bg5 f6 27.Bf4 Ne3 0-1

[Event "Barcelona Ideal Clave op 21st"] [Site "Barcelona"] [Date "2017.10.07"] [Round "2"] [White "Rojas Nunez, Alberto"] [Black "Lopez Gomez, Nicolas"] [Result "0-1"] [WhiteElo "1276"] [BlackElo "1526"] [EventDate "2017.09.30"] [PlyCount "58"] [EventType "swiss"] [EventRounds "8"] [EventCountry "ESP"] [SourceTitle "CBM 181 Extra"] [Source "ChessBase"] [SourceDate "2017.12.12"] [SourceVersion "1"] [SourceVersionDate "2017.12.12"] [SourceQuality "1"]

1.e4 e5 2.Nf3 Nc6 3.d4 exd4 4.Nxd4 Ne5 5.Nc3 Bb4 6.a3 Bxc3+ 7.bxc3 d6 8. Bd3 Nf6 9.f4 Nxd3+ 10.cxd3 O-O 11.O-O Re8 12.f5 d5 13.e5 Rxe5 14.Qf3 c5 15.Bf4 Re7 16.Ne2 Bxf5 17.Ng3 Bg4 18.Qf2 b6 19.h3 Bh5 20.Rae1 Bg6 21.Qf3 Qd7 22.c4 Rae8 23.Rd1 d4 24.h4 h6 25.Bxh6 Ng4 26.Bg5 f6 27.Bf4 Ne3 28.h5 Bh7 29.Bxe3 Rxe3 0-1

I used these two commends ocgdb -pgn one_embeded_game.pgn -db one_embeded_game.db3 -cpu 4 -o moves2;discardcomments;discardsites;discardfen;reseteco -plycount 40

ocgdb -db one_embeded_game.db3 -dup -o embededgames -r report.txt

ocgdb Beta 7 didn't find the imbedded double Or maybe I do something wrong? Screenshot search for embedded game

nguyenpham commented 2 years ago

Thanks for the report.

The bug is fixed by the commit https://github.com/nguyenpham/ocgdb/commit/1a12d17cda25360c329169e601f0201f8d1c333e

(it is a funny bug: after testing, I cleaned the code and accidentally removed some new ones).

The executing file could be downloaded with the below link:

Jonathan003 commented 2 years ago

Thanks for the fix. Is it possible when doing a search for all duplicates, including imbedded duplicates, in one go. To create a report only for the find imbedded duplicates? I think the imbedded duplicates needs most attention to decide manually what duplicate you want to keep.

Jonathan003 commented 2 years ago

I tested it out with this small example database with duplicates: 60_games_with_doubles

I used these 3 commends one after the other:

ocgdb -pgn 60_games_with_doubles.pgn -db 60_games_with_doubles.db3 -cpu 4 -o moves2;discardcomments;discardsites;discardfen;reseteco

ocgdb -db 60_games_with_doubles.db3 -cpu 4 -dup -o printall;embededgames;remove -r report.txt

ocgdb -pgn 60_games_with_doubles_out.pgn -db 60_games_with_doubles.db3 -cpu 4 -export

ocgdb beta 7 dit not find this included duplicate the first run.

[Event "Rated Blitz game"] [Site ""] [Date "2019.04.21"] [Round ""] [White "FischersFrisoer"] [Black "MarkoMakaj"] [Result "1/2-1/2"] [WhiteElo "2333"] [BlackElo "2372"] [ECO "E87"] [TimeControl "180"] [PlyCount "24"]

1.d4 Nf6 2.c4 g6 3.Nc3 Bg7 4.e4 d6 5.f3 O-O 6.Be3 e5 7.d5 Nh5 8.Qd2 Qh4+ 9.Bf2 Qf4 10.Be3 Qh4+ 11.Bf2 Qf4 12.Be3 Qh4+ 1/2-1/2

[Event "Rated Blitz game"] [Site ""] [Date "2018.09.27"] [Round ""] [White "FischersFrisoer"] [Black "toramal"] [Result "1/2-1/2"] [WhiteElo "2277"] [BlackElo "2213"] [ECO "E87"] [TimeControl "180"] [PlyCount "25"]

1.d4 Nf6 2.c4 g6 3.Nc3 Bg7 4.e4 d6 5.f3 O-O 6.Be3 e5 7.d5 Nh5 8.Qd2 Qh4+ 9.Bf2 Qf4 10.Be3 Qh4+ 11.Bf2 Qf4 12.Be3 Qh4+ 13.Bf2 1/2-1/2

When I run the same comments again with these two games, ocgdb found the imbedded duplicate.

Jonathan003 commented 2 years ago

I did some other test to see how good the detection of imbedded duplicates works with ocgdb beta 7

I run the commends on this pgn file imbedded_duplicates

I run these commends on after the other

ocgdb -pgn imbedded_duplicates.pgn -db imbedded_duplicates.db3 -cpu 4 -o moves2;discardcomments;discardsites;discardfen;reseteco

ocgdb -db imbedded_duplicates.db3 -cpu 4 -dup -o printall;embededgames;remove -r report.txt

ocgdb -pgn imbedded_duplicates_out.pgn -db imbedded_duplicates.db3 -cpu 4 -export

ocgdb didn't detect these imbedded duplicates

not_find_duplicates

nguyenpham commented 2 years ago

Thanks for the report!

At the moment the program works roughly as below to find duplicates: compare each game with all other games to find matching (fully or partly). If found, stop for that game and do reporting/removing.

That means that game (being matched to one other game) may not be continued to compare/found more results with the rest.

Your test database is quite special in which a game could be embedded inside many other games. Thus OCGDB may detect the first embedded duplicate of a game but not other ones.

Jonathan003 commented 2 years ago

So If I run this comment: ocgdb -db imbedded_duplicates.db3 -cpu 4 -dup -o printall;embededgames;remove -r report.txt Multiple times, I will eventually be able to remove al the imbedded duplicates? I created the pgn by importing best moves for one color from a bin book in Lucas Chess to opening lines. If I remove al last moves (1 ply) if it is a black move. And if I I then export the opening lines to pgn there are many imbedded duplicates included. SCID has no problem detecting al these imbedded duplicates in on go to remove them. I hope this wil be possible with OCGDB to with future updates.

nguyenpham commented 2 years ago

I have just improved the code. It can detect and remove all duplicates in one run only

Jonathan003 commented 2 years ago

I have tried beta7c with the example pgn database imbedded_duplicates

Something goes wrong because almost al games get removed, except these two games

[Event "imbedded duplicates"] [PlyCount "5"] [ECO "B28"] [White ""] [Result ""] [Black ""] [Round ""] [Date ""] [Site ""]

  1. e4 c5 2. Nf3 a6 3. c3 *

[Event "imbedded duplicates"] [PlyCount "1"] [Site ""] [White ""] [Result ""] [Black ""] [Round ""] [Date ""]

  1. e4 *

I run these commends on after the other

ocgdb -pgn imbedded_duplicates.pgn -db imbedded_duplicates.db3 -cpu 4 -o moves2;discardcomments;discardsites;discardfen;reseteco

ocgdb -db imbedded_duplicates.db3 -cpu 4 -dup -o printall;embededgames;remove -r report.txt

ocgdb -pgn imbedded_duplicates_out.pgn -db imbedded_duplicates.db3 -cpu 4 -export

If I search the example pgn database 'imbedded_duplicates.pgn' for duplicates with SCID it founds 43 duplicates.

nguyenpham commented 2 years ago

On my computer, the app removed all games except one. Tried several times.

I guess there may be a conflict between threads on your computer, say, two threads pick up two games at the same time and one is an embedded game of the other one and they may prevent each other from being deleted (actually I have considered all cases, the code/logic could cover those case, thus I am not really sure what happened).

Perhaps users should do another run, just for extreme cases.

I have been still trying to reproduce and check the code.

Jonathan003 commented 2 years ago

I don't understand what you are saying? On my computer also all games got removed except the two games I listed. That's the problem also many games that are not duplicates or embedded duplicates get removes by ocgdb beta 7 c. And only the 43 duplicates or embedded duplicates should get removed, like if you remove the duplicates with SCID. Maybe I named the example pgn databases wrong 'imbedded_duplicates.pgn' A better name would be '311_games_with_43_embedded_duplicates.pgn'

I like SCID to remove duplicates and embedded duplicates. Except that it doesn't work when the first 4 letters of the player names differentiate. And also there is problem with SCID to handle huge pgn databases with more than 20 million games. OCGDB looks promising to me. I hope you can fix the issues.

Keep up the good work!

nguyenpham commented 2 years ago

OK, I have found some weak points in the algorithms for checking embedded games. Hope it works now:

ocgdb-WinMac-beta7d.zip

Jonathan003 commented 2 years ago

Thanks for the fix! I have tested it with the example database and ocgdb beta 7 d find 43 imbedded duplicates, the same as with SCID

Jonathan003 commented 2 years ago

I did some other small test with ocgdb beta 7 d I used this example pgn: 24_games_1_embedded_duplicate

I typed these commands one after the other:

ocgdb -pgn 24_games_1_embedded_duplicate.pgn -db 24_games_1_embedded_duplicate.db3 -cpu 4 -o moves2;discardcomments;discardsites;discardfen;reseteco

ocgdb -db 24_games_1_embedded_duplicate.db3 -cpu 4 -dup -o printall;embededgames;remove -r report.txt

ocgdb -pgn 24_games_1_embedded_duplicate_out.pgn -db 24_games_1_embedded_duplicate.db3 -cpu 4 -export

ocgdb beta 7 d, did not find this embedded duplicate

[Event "Rated Blitz game"] [White "FischersFrisoer"] [BlackElo "2303"] [Result "1/2-1/2"] [Black "MarkoMakaj"] [Date "2019.03.06"] [TimeControl "180"] [WhiteElo "2366"] [ECO "E87"] [PlyCount "25"] [Round ""] [Site ""]

1.d4 Nf6 2.c4 g6 3.Nc3 Bg7 4.e4 d6 5.f3 O-O 6.Be3 e5 7.d5 Nh5 8.Qd2 Qh4+ 9.Bf2 Qf4 10.Be3 Qh4+ 11.Bf2 Qf4 12.Be3 Qh4+ 13.Bf2 1/2-1/2

[Event "Rated Blitz game"] [White "FischersFrisoer"] [BlackElo "2245"] [Result "1/2-1/2"] [Black "toramal"] [Date "2018.05.08"] [TimeControl "180"] [WhiteElo "2323"] [ECO "E87"] [PlyCount "24"] [Round ""] [Site ""]

1.d4 Nf6 2.c4 g6 3.Nc3 Bg7 4.e4 d6 5.f3 O-O 6.Be3 e5 7.d5 Nh5 8.Qd2 Qh4+ 9.Bf2 Qf4 10.Be3 Qh4+ 11.Bf2 Qf4 12.Be3 Qh4+ 1/2-1/2

nguyenpham commented 2 years ago

Thanks for the report.

The bug is caused by multi threads conflict (two embedded games may be processed concurrently). The temporary solution is to set -cpu 1 then the program could run correctly.

Jonathan003 commented 2 years ago

It looks like if I leave the setting for cpu to 4, but use the option moves1 instead of moves2, the imbedded duplicate also gets detected. ocgdb -pgn 24_games_1_embedded_duplicate.pgn -db 24_games_1_embedded_duplicate.db3 -cpu 4 -o moves1;discardcomments;discardsites;discardfen;reseteco Or would that be a coincidence? What is the difference between moves1 and moves2? Will all options of ocgdb beta 7 d also work with the option moves1?

nguyenpham commented 2 years ago

Yes, any change may make processes a little bit different, delete the chance in which two embedded games being processed concurrently.

You may change the parameter -cpu, say from 4 to 3 and can help too.

Moves1 is to encode a move into one byte when Moves2 to two bytes. Moves1 can help to have smaller size databases. However, encoding/decoding code of Moves1 is quite complicated and other programs have almost no way but to integrate with our code (OCGDB code). That may be a big trouble if other programs use different programming language. Integration code with other libraries/programs typically is not easy and need some labours. In contrast, code for Moves2 is quite simple, straightforward, using Stockfish board coordinates. Any program could implement itself the encoding/decoding without integrating OCGDB's code. Thus we encourage/suggest people to use Moves2 for being easier, sharing/publishing. Moves1 may be good for private uses, researches.