Closed Jonathan003 closed 2 years ago
That is doable.
However, it may take a lot of time, slow painfully. The reason is that without extra conditions (such as same players), the app has to compare a game with all other games, the complexity becomes O(n!), much larger than the O(n) of our current function to check duplicates.
With Scid it goes very fast to find these doubles, where games are included in other games. Even directly with a pgn file. Maybe it goes fast because the first 4 letters of the player names have to be the same with Scid? If that's the problem than maybe it would be useful to first search for doubles with ocgdb with the standard setting. And then do a search with ocgdb to search for games included in other games with the same user names. Or with the same first four letters of the player names.
These are the settings when searching for doubles in Scid. I like these setting except that there is no option to find doubles when the first 4 letters of the player names are different.
I did some test with Scid. I removed al the player names for white and for black in a huge pgn database with Chess Assistant 20. Than I searched this pgn database for duplicates. It was still very fast and doubles because games are included in other games where also detected.
I have implemented the request. It could run very fast, almost as fast as the normal duplicate-check function.
Run it as:
ocgdb -db big.ocgdb.db3 -dup -o embededgames -r report.txt
Committed: https://github.com/nguyenpham/ocgdb/commit/8726834462c0037d7b8fe97bc6228e28feca31db
Thanks! Can I download the updated ocgdb tool somewhere to give it a try?
We have released a new version Beta 7 https://github.com/nguyenpham/ocgdb/releases/tag/VersionBeta7
I have tried it with with ocgdb Beta 7
With these two games
[Event "Barcelona Ideal Clave op 21st"] [Site "Barcelona"] [Date "2017.10.07"] [Round "2"] [White "Rojas Nunez, Alberto"] [Black "Lopez Gomez, Nicolas"] [Result "0-1"] [WhiteElo "1276"] [BlackElo "1526"] [EventDate "2017.09.30"] [PlyCount "54"] [EventType "swiss"] [EventRounds "8"] [EventCountry "ESP"] [SourceTitle "CBM 181 Extra"] [Source "ChessBase"] [SourceDate "2017.12.12"] [SourceVersion "1"] [SourceVersionDate "2017.12.12"] [SourceQuality "1"]
1.e4 e5 2.Nf3 Nc6 3.d4 exd4 4.Nxd4 Ne5 5.Nc3 Bb4 6.a3 Bxc3+ 7.bxc3 d6 8. Bd3 Nf6 9.f4 Nxd3+ 10.cxd3 O-O 11.O-O Re8 12.f5 d5 13.e5 Rxe5 14.Qf3 c5 15.Bf4 Re7 16.Ne2 Bxf5 17.Ng3 Bg4 18.Qf2 b6 19.h3 Bh5 20.Rae1 Bg6 21.Qf3 Qd7 22.c4 Rae8 23.Rd1 d4 24.h4 h6 25.Bxh6 Ng4 26.Bg5 f6 27.Bf4 Ne3 0-1
[Event "Barcelona Ideal Clave op 21st"] [Site "Barcelona"] [Date "2017.10.07"] [Round "2"] [White "Rojas Nunez, Alberto"] [Black "Lopez Gomez, Nicolas"] [Result "0-1"] [WhiteElo "1276"] [BlackElo "1526"] [EventDate "2017.09.30"] [PlyCount "58"] [EventType "swiss"] [EventRounds "8"] [EventCountry "ESP"] [SourceTitle "CBM 181 Extra"] [Source "ChessBase"] [SourceDate "2017.12.12"] [SourceVersion "1"] [SourceVersionDate "2017.12.12"] [SourceQuality "1"]
1.e4 e5 2.Nf3 Nc6 3.d4 exd4 4.Nxd4 Ne5 5.Nc3 Bb4 6.a3 Bxc3+ 7.bxc3 d6 8. Bd3 Nf6 9.f4 Nxd3+ 10.cxd3 O-O 11.O-O Re8 12.f5 d5 13.e5 Rxe5 14.Qf3 c5 15.Bf4 Re7 16.Ne2 Bxf5 17.Ng3 Bg4 18.Qf2 b6 19.h3 Bh5 20.Rae1 Bg6 21.Qf3 Qd7 22.c4 Rae8 23.Rd1 d4 24.h4 h6 25.Bxh6 Ng4 26.Bg5 f6 27.Bf4 Ne3 28.h5 Bh7 29.Bxe3 Rxe3 0-1
I used these two commends
ocgdb -pgn one_embeded_game.pgn -db one_embeded_game.db3 -cpu 4 -o moves2;discardcomments;discardsites;discardfen;reseteco -plycount 40
ocgdb -db one_embeded_game.db3 -dup -o embededgames -r report.txt
ocgdb Beta 7 didn't find the imbedded double Or maybe I do something wrong?
Thanks for the report.
The bug is fixed by the commit https://github.com/nguyenpham/ocgdb/commit/1a12d17cda25360c329169e601f0201f8d1c333e
(it is a funny bug: after testing, I cleaned the code and accidentally removed some new ones).
The executing file could be downloaded with the below link:
Thanks for the fix. Is it possible when doing a search for all duplicates, including imbedded duplicates, in one go. To create a report only for the find imbedded duplicates? I think the imbedded duplicates needs most attention to decide manually what duplicate you want to keep.
I tested it out with this small example database with duplicates: 60_games_with_doubles
I used these 3 commends one after the other:
ocgdb -pgn 60_games_with_doubles.pgn -db 60_games_with_doubles.db3 -cpu 4 -o moves2;discardcomments;discardsites;discardfen;reseteco
ocgdb -db 60_games_with_doubles.db3 -cpu 4 -dup -o printall;embededgames;remove -r report.txt
ocgdb -pgn 60_games_with_doubles_out.pgn -db 60_games_with_doubles.db3 -cpu 4 -export
ocgdb beta 7 dit not find this included duplicate the first run.
[Event "Rated Blitz game"] [Site ""] [Date "2019.04.21"] [Round ""] [White "FischersFrisoer"] [Black "MarkoMakaj"] [Result "1/2-1/2"] [WhiteElo "2333"] [BlackElo "2372"] [ECO "E87"] [TimeControl "180"] [PlyCount "24"]
1.d4 Nf6 2.c4 g6 3.Nc3 Bg7 4.e4 d6 5.f3 O-O 6.Be3 e5 7.d5 Nh5 8.Qd2 Qh4+ 9.Bf2 Qf4 10.Be3 Qh4+ 11.Bf2 Qf4 12.Be3 Qh4+ 1/2-1/2
[Event "Rated Blitz game"] [Site ""] [Date "2018.09.27"] [Round ""] [White "FischersFrisoer"] [Black "toramal"] [Result "1/2-1/2"] [WhiteElo "2277"] [BlackElo "2213"] [ECO "E87"] [TimeControl "180"] [PlyCount "25"]
1.d4 Nf6 2.c4 g6 3.Nc3 Bg7 4.e4 d6 5.f3 O-O 6.Be3 e5 7.d5 Nh5 8.Qd2 Qh4+ 9.Bf2 Qf4 10.Be3 Qh4+ 11.Bf2 Qf4 12.Be3 Qh4+ 13.Bf2 1/2-1/2
When I run the same comments again with these two games, ocgdb found the imbedded duplicate.
I did some other test to see how good the detection of imbedded duplicates works with ocgdb beta 7
I run the commends on this pgn file imbedded_duplicates
I run these commends on after the other
ocgdb -pgn imbedded_duplicates.pgn -db imbedded_duplicates.db3 -cpu 4 -o moves2;discardcomments;discardsites;discardfen;reseteco
ocgdb -db imbedded_duplicates.db3 -cpu 4 -dup -o printall;embededgames;remove -r report.txt
ocgdb -pgn imbedded_duplicates_out.pgn -db imbedded_duplicates.db3 -cpu 4 -export
ocgdb didn't detect these imbedded duplicates
Thanks for the report!
At the moment the program works roughly as below to find duplicates: compare each game with all other games to find matching (fully or partly). If found, stop for that game and do reporting/removing.
That means that game (being matched to one other game) may not be continued to compare/found more results with the rest.
Your test database is quite special in which a game could be embedded inside many other games. Thus OCGDB may detect the first embedded duplicate of a game but not other ones.
So If I run this comment:
ocgdb -db imbedded_duplicates.db3 -cpu 4 -dup -o printall;embededgames;remove -r report.txt
Multiple times, I will eventually be able to remove al the imbedded duplicates?
I created the pgn by importing best moves for one color from a bin book in Lucas Chess to opening lines. If I remove al last moves (1 ply) if it is a black move. And if I I then export the opening lines to pgn there are many imbedded duplicates included. SCID has no problem detecting al these imbedded duplicates in on go to remove them.
I hope this wil be possible with OCGDB to with future updates.
I have just improved the code. It can detect and remove all duplicates in one run only
I have tried beta7c with the example pgn database imbedded_duplicates
Something goes wrong because almost al games get removed, except these two games
[Event "imbedded duplicates"] [PlyCount "5"] [ECO "B28"] [White ""] [Result ""] [Black ""] [Round ""] [Date ""] [Site ""]
[Event "imbedded duplicates"] [PlyCount "1"] [Site ""] [White ""] [Result ""] [Black ""] [Round ""] [Date ""]
I run these commends on after the other
ocgdb -pgn imbedded_duplicates.pgn -db imbedded_duplicates.db3 -cpu 4 -o moves2;discardcomments;discardsites;discardfen;reseteco
ocgdb -db imbedded_duplicates.db3 -cpu 4 -dup -o printall;embededgames;remove -r report.txt
ocgdb -pgn imbedded_duplicates_out.pgn -db imbedded_duplicates.db3 -cpu 4 -export
If I search the example pgn database 'imbedded_duplicates.pgn' for duplicates with SCID it founds 43 duplicates.
On my computer, the app removed all games except one. Tried several times.
I guess there may be a conflict between threads on your computer, say, two threads pick up two games at the same time and one is an embedded game of the other one and they may prevent each other from being deleted (actually I have considered all cases, the code/logic could cover those case, thus I am not really sure what happened).
Perhaps users should do another run, just for extreme cases.
I have been still trying to reproduce and check the code.
I don't understand what you are saying? On my computer also all games got removed except the two games I listed. That's the problem also many games that are not duplicates or embedded duplicates get removes by ocgdb beta 7 c. And only the 43 duplicates or embedded duplicates should get removed, like if you remove the duplicates with SCID. Maybe I named the example pgn databases wrong 'imbedded_duplicates.pgn' A better name would be '311_games_with_43_embedded_duplicates.pgn'
I like SCID to remove duplicates and embedded duplicates. Except that it doesn't work when the first 4 letters of the player names differentiate. And also there is problem with SCID to handle huge pgn databases with more than 20 million games. OCGDB looks promising to me. I hope you can fix the issues.
Keep up the good work!
OK, I have found some weak points in the algorithms for checking embedded games. Hope it works now:
Thanks for the fix! I have tested it with the example database and ocgdb beta 7 d find 43 imbedded duplicates, the same as with SCID
I did some other small test with ocgdb beta 7 d I used this example pgn: 24_games_1_embedded_duplicate
I typed these commands one after the other:
ocgdb -pgn 24_games_1_embedded_duplicate.pgn -db 24_games_1_embedded_duplicate.db3 -cpu 4 -o moves2;discardcomments;discardsites;discardfen;reseteco
ocgdb -db 24_games_1_embedded_duplicate.db3 -cpu 4 -dup -o printall;embededgames;remove -r report.txt
ocgdb -pgn 24_games_1_embedded_duplicate_out.pgn -db 24_games_1_embedded_duplicate.db3 -cpu 4 -export
ocgdb beta 7 d, did not find this embedded duplicate
[Event "Rated Blitz game"] [White "FischersFrisoer"] [BlackElo "2303"] [Result "1/2-1/2"] [Black "MarkoMakaj"] [Date "2019.03.06"] [TimeControl "180"] [WhiteElo "2366"] [ECO "E87"] [PlyCount "25"] [Round ""] [Site ""]
1.d4 Nf6 2.c4 g6 3.Nc3 Bg7 4.e4 d6 5.f3 O-O 6.Be3 e5 7.d5 Nh5 8.Qd2 Qh4+ 9.Bf2 Qf4 10.Be3 Qh4+ 11.Bf2 Qf4 12.Be3 Qh4+ 13.Bf2 1/2-1/2
[Event "Rated Blitz game"] [White "FischersFrisoer"] [BlackElo "2245"] [Result "1/2-1/2"] [Black "toramal"] [Date "2018.05.08"] [TimeControl "180"] [WhiteElo "2323"] [ECO "E87"] [PlyCount "24"] [Round ""] [Site ""]
1.d4 Nf6 2.c4 g6 3.Nc3 Bg7 4.e4 d6 5.f3 O-O 6.Be3 e5 7.d5 Nh5 8.Qd2 Qh4+ 9.Bf2 Qf4 10.Be3 Qh4+ 11.Bf2 Qf4 12.Be3 Qh4+ 1/2-1/2
Thanks for the report.
The bug is caused by multi threads conflict (two embedded games may be processed concurrently). The temporary solution is to set -cpu 1
then the program could run correctly.
It looks like if I leave the setting for cpu to 4, but use the option moves1 instead of moves2, the imbedded duplicate also gets detected.
ocgdb -pgn 24_games_1_embedded_duplicate.pgn -db 24_games_1_embedded_duplicate.db3 -cpu 4 -o moves1;discardcomments;discardsites;discardfen;reseteco
Or would that be a coincidence?
What is the difference between moves1 and moves2?
Will all options of ocgdb beta 7 d also work with the option moves1?
Yes, any change may make processes a little bit different, delete the chance in which two embedded games being processed concurrently.
You may change the parameter -cpu, say from 4 to 3 and can help too.
Moves1 is to encode a move into one byte when Moves2 to two bytes. Moves1 can help to have smaller size databases. However, encoding/decoding code of Moves1 is quite complicated and other programs have almost no way but to integrate with our code (OCGDB code). That may be a big trouble if other programs use different programming language. Integration code with other libraries/programs typically is not easy and need some labours. In contrast, code for Moves2 is quite simple, straightforward, using Stockfish board coordinates. Any program could implement itself the encoding/decoding without integrating OCGDB's code. Thus we encourage/suggest people to use Moves2 for being easier, sharing/publishing. Moves1 may be good for private uses, researches.
This is an example where Scid finds a duplicated game because on game is included in another game. Ath the moment ocgdb don't detect these doubles. Scid only detect these count of doubles when the player names are identical I also want an option to detect these doubles when player names are different or other information like event is different. Than I want ocgdb to keep the better game, (longer game, more recent, higher elo, longer time control ,etc). Maybe it are technically not exact doubles. But I don't see how it can be useful to keep games with exact the same moves, and exact the same results, together in a database. Doubles where one game is included in another happens quite often in human tournaments when using DGT boards, sometimes the feed gets an extra move or a database feed like TWIC does an update one week and provides a game correction the next week.