nguyenpham / ocgdb

Open Chess Game Database Standard (OCGDB)
MIT License
31 stars 8 forks source link

Don't get option to discard games with ply-count under n to work #27

Closed Jonathan003 closed 2 years ago

Jonathan003 commented 2 years ago

I did a trial to test the option to remove shorter games than given ply I typed these comments one after the other ocgdb -pgn human.pgn -db human.db3 -cpu 4 -o moves2;discardsites ocgdb -db human.db3 -cpu 4 -plycount 80 -dup -o printall;remove ocgdb -pgn human_out.pgn -db human.db3 -cpu 4 -export

Al games shorter than 80 ply are stil in the output human_out.pgn I probably do some mistyping in the second command.

ghost commented 2 years ago

The second command removed all the duplicates over 80 ply.

You can create an output file by using -r to check to see what was removed.

For example: ocgdb -db human.db3 -cpu 4 -plycount 80 -dup -o printall;remove -r removed.pgn

The pgn is readable with Banksia and Chessbase.

nguyenpham commented 2 years ago

You may remove shorter games when creating the new database. The first command should be:

ocgdb -pgn human.pgn -db human.db3 -cpu 4 -o moves2;discardsites -plycount 80

Jonathan003 commented 2 years ago

Thanks! Out of curiosity why did you ad an option to discardsites? I know it sometimes help if you remove the site tag in a huge pgn before importing the pgn in a SCID file. Otherwise Scid fails to import the pgn and you get error messages like "to many player names"

nguyenpham commented 2 years ago

Some games from some sources (such as Lichess) may use URI for sites. It means each game may have a unique string for its Site tag. 100 million games will require 100 million records for the Site table. That may make our app (as well as any other chess databases app) slow down significantly when the benefit of storing such information is almost zero.

Discard Site for Lichess games (or any similar ones) will solve that problem. You may keep that field for games from other sources or when the number of games is not too big (say, under 1M).

BTW, in case users don't discard the Site, OCGDB will auto-detect strings of the Site from Lichess, if yes, it will save them into a new field named "Source" in the Games table. Just to avoid writing that informal too many times into the Site table (too slow).

More details here:

http://talkchess.com/forum3/viewtopic.php?f=7&t=78464&start=240#p919062

Jonathan003 commented 2 years ago

Thanks for the information.