nguyenpham / ocgdb

Open Chess Game Database Standard (OCGDB)
MIT License
31 stars 8 forks source link
chess database sqlite standard

Open Chess Game Database Standard (OCGDB)

Version Beta

Brief of main ideas/techniques

Why OCGDB? Features/Highlights

We believe it is one of the fastest (in terms of speeds of creating and querying/searching), smallest (in terms of database sizes), largest (in terms of game numbers), and smartest (in terms of querying/position-searching) chess game database programs. It could compete for all parameters, results with the best chess game database formats and their programs/tools.

Overview

Almost all popular/published chess game databases are in binary formats. Those formats have some advantages such as being fast and compact. However, they also have some drawbacks:

Typically a binary format is attached to a specific chess software and the exchange between formats is very limited.

The advantages of being binary formats were very important in the past when computers were weak and Internet speed was slow. However, they become less important nowadays since computers become much stronger and Internet is much faster. That has been making other formats become more attractive.

On the other hand, we have already the PGN format for chess games, which could be used for chess databases too. It has some advantages:

However, there are many drawbacks to using PGN as chess databases:

Thus we think we need a new format that could be used to exchange databases between different apps and/or for web apps. Starting requirements:

SQL/SQLite

We have picked up SQL/SQLite as the main tools/direction to develop the new database. Using SQL for the chess game database is not new, someone has tried already (and gave up). We have known it has some reason drawbacks such as slow and large on size.

However, it has some advantages:

Overcome drawbacks

We confirm we have got all necessary information and overcomed all drawbacks of using SQL/SQLite. The chess game databases work very well and they are one of the smallest and fastest ones in chess game database world.

File name extension

It should keep the normal extension of a database. For example, the one for SQLite should have the extension .db3. It should have upper/extra extension .ocgdb to distinguish it from other databases. For examples of file names:

  mb345.ocgdb.db3
  carlsen.ocgdb.db3

Names

Names of tables, columns... should be Camel style, less space and close to PGN tag names as much as possible. For examples: White, BlackElo, PlyCount, GameCount, FEN

Some important Table names

Field names

PGN standard requires some name tags (compulsory), the database should have those fields too. If one is an index field, uses suffix ID. An important field is Moves to keep all moves in text form.

For examples of field names: EventID, WhiteID, BlackElo, Round, Result, Date, FEN, Moves

Popular fields in table Games

We suggest beside ID, FEN and move fields, Games should have following fields according to popular PGN tags:

EventID, SiteID, Date, Round, WhiteID, WhiteElo, BlackID, BlackElo, Result, TimeControl, ECO, PlyCount

Other tag fields in table Games

When a game has some tags that are not in the above list, users can choose to create new fields to store information of those tags or not.

Field types

Fields PlyCount, WhiteElo, BlackElo and ones for IDs (such as ID, EventID) should be INTEGER. Fields Moves1 and Moves2 should be BLOB. Other fields should be TEXT.

NULL

Except for fields of identicals such EventID, SiteID, WhiteID, BlackID, values of other fields could be NULL.

FEN field

A FEN string of each game could be stored (FEN strings) in that field. If the game starts from start-position, it could be saved as a NULL string.

Moves fields

All moves of a game could be stored in some forms in some specific fields as the below:

All moves (of a game) in Moves1/Moves2 are stored as a binary array and then saved into the database as a blob.

The algorithms of Moves and Moves2 are very simple, developers can easily encode and decode using their own code.

In contrast, the algorithm of Moves1 is quite complicated, deeply integrated into our code/library. It is not easy for developers to write their own encode/decode.

Even though Moves1 can create the smallest databases, users should consider using Moves and/or Moves2 instead, just for being easy to use by other libraries.

Date fields

Dates in PGN have format YYYY.MM.DD (such as "2022.02.21"). They are must be converted into SQLite's date standard (IS0-8601) as YYYY-MM-DD (such as "2022-02-21"). Dates are stored as text fields.

Some details of PGN dates may be missing and displayed as question (?) marks. The program simply replace each question mark by '1'. E.g., from "1950.??.??" to be "1950.11.11".

Info Table

Variant

Store chess variant information.

GameCount

Store number of records in Games table. It is useful for quickly retracting that number since querying by statement "SELECT COUNT" could take a while for huge database

Converting speed

We tested on an iMac 3.6 GHz Quad-Core i7, 16 GB RAM (the year 2017), converting a PGN file with 3.45 million games, size 2.42 GB.

Convert into a db3 file (.db3, stored on the hard disk), required 29 seconds:

    #games: 3457050, elapsed: 29977 ms, 00:29, speed: 115323 games/s

Convert into a memory database (output path is :memory:) on the RAM, required 22 seconds:

    #games: 3457050, elapsed: 22067 ms, 00:22, speed: 156661 games/s

Convert a PGN file of 94 million games from Lichess:

#games: 93679650, elapsed: 5209214ms 1:26:49, speed: 17983 games/s, #blocks: 25777, processed size: 206208 MB

Retrieve data

Query database and extract some important data fields:

for (auto cnt = 0; statement.executeStep(); cnt++) {
    auto gameID = statement.getColumn("ID").getInt64(); assert(gameID > 0);
    auto fenText = statement.getColumn("FEN").getText();
    auto moveText = statement.getColumn("Moves").getText();
}

Query database, extract some data fields and parse into chessboard, using multi-threads:

for auto cnt = 0; statement.executeStep(); cnt++) {
    auto gameID = statement.getColumn("ID").getInt64();
    auto fenText = statement.getColumn("FEN").getText();
    auto moveText = statement.getColumn("Moves").getText();
    threadParsePGNGame(gameID, fenText, moveText);
}

Position Query Language (PQL)

The EBNF (Extended Backus Naur Form) of the language is as the below:

clause = condition { ("and" | "or" | "&&" | "||") condition }
condition = expression { ( "=" | "<" | "<="| " >" | ">=" | "==“ | "!=" | "<>" ) expression } | "fen" "[" fenset "]"
fenset = fenstring {"," fenstring }
expression = term  {( "+" | "-" ) term }
term = factor {( "*" | "/" ) factor} 
factor = number | piece | "(" expression ")"
piece = piecename (<empty> | square | squareset)
piecename = "K" | "Q" | "R" | "B" | "N" | "P" | "k" | "q" | "r" | "b" | "n" | "p" | "white" | "black"
squareset = column | row | "[" (square | squarerange | columnrange | rowrange) { "," (square | squarerange | columnrange | rowrange) } "]"
squarerange = square "-" square
columnrange = column "-" column
rowrange = row "-" row
square = column row 
column = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h"
row = "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8"

Some symbols are mixed between SQL style and C/C++ style (such as users can use both <> and != for not-equal comparison) just for convenience.

A condition/expression may have some chess piece names such as Q (for white Queens), p (for black Pawns). When evaluating, they are counted as cardinalities/total numbers of those chess pieces on the chessboard.

For examples:

R                the total number of White Rooks
qb3              the total number of Black Queens on square b3
B3               the total number of White Bishops on row 3
bb               the total number of Black Bishops on column b
n[b-e]           the total number of Black Knights from column b to e
P[a4, c5, d5]    the total number of White Pawns on squares a4, c5, and d5

Users can do all calculations between those pieces, numbers and all will turn to conditions (true/false) in the end.

The condition may be implicit or explicit:

R         the implicit form of the comparison R != 0
R == 3        the total of White Rooks must be 3
q[5-7] >= 2     the total of Black Queens from row 5 to row 7 must be equal or larger than 2

Comments

Any text after // will be trimmed out.

Some other examples:

// find all positions having 3 White Queens
Q = 3

// 3 White Queens, written in inverted way
3 = Q

// Find all positions having two Black Rooks in the middle squares
r[e4, e5, d4, d5] = 2

// White Pawns in d4, e5, f4, g4, Black King in b7
P[d4, e5, f4, g4] = 4 and kb7

// black Pawns in column b, row 5, from square a2 to d2 must be smaller than 4
p[b, 5, a2 - d2] < 4

// Black Pawns in column c more than 1
pc > 1

// White Pawns in row 3 from 2
P3 >= 2

// Two Bishops in column c, d, e, f
B[c-f] + b[c-f] = 2

// There are 5 white pieces in row 6
white6 = 5

// There are 5 white pieces in row 6 - another way to write
white[6] = 5

// total back pieces in columns from a to c is more than 8
black[a-c] > 8

// Find by a fen string
fen[r2qkb1r/1p2nppp/p3p3/4Pb2/2BB4/P7/1P3PPP/RN1Q1RK1 b kq - 0 12]

// Find by some fen strings
fen[rnbqkbnr/pp2pppp/2p5/3pP3/3P4/8/PPP2PPP/RNBQKBNR b KQkq - 0 3, rn1qkbnr/pp2pppp/2p5/3pPb2/3P4/5N2/PPP2PPP/RNBQKB1R b KQkq - 1 4]

The Parser

Because the language is very simple and input strings are typically so short, we implement its parser in a simple, straightforward way, using the recursive method. From an input string (query), the parser will create an evaluation tree. That tree will be evaluated at every chess position with the parameters as a set of bitboards of that position. The position and its game will be picked up as the result of the evaluation of the tree is true (not zero).

Sample databases

There are two sample databases in the samples project at: https://github.com/nguyenpham/ocgdb-samples

You may open it with any SQLite browsers/tools and make some queries to understand its structures, speed, advantages, and disadvantages.

SQL commands

The file SqlCmd.md contains some SQL commands to create databases and tables, examples to insert, and query tables.

Code

All samples, tools are C++ 17.

Compile

Run make file in the subfolder src:

cd src
make

In macOS, Windows, you can run and compile with Xcode/Visual Studio with the project files in the folder projects.

Usage

Users may run the program:

History

License

MIT: Almost totally free (requires just fair use). All documents, codes, data samples in this project are ready and free to integrate into other apps without worrying about license compatibility.