mcostalba / chess_db

GNU General Public License v3.0
22 stars 5 forks source link

Support other output formats #6

Closed sshivaji closed 7 years ago

sshivaji commented 7 years ago

Currently, the repo supports only polyglot output. It does do this very fast (kudos)!!

The output currently returns the moves from any position and the weights. For a tournament chess player, it is typical to get

  1. Moves, weights, win/draw/loss statistics.
  2. All games played from that position, this can just be game ids and header information, and one can externally load the game when selecting it.

I previously supported 1 and 2 in this code base branch - https://github.com/sshivaji/polyglot/tree/leveldb

I realize the overhead might be a lot with game ids but the information is quite useful. I can submit a patch for output to leveldb and/or sqlite. Main question is what is your thought on the best way to proceed.

gbtami commented 7 years ago

It would be cool to see chess_db evolving from a simple tool to a full featured pgn parser library with some nice API something like https://github.com/niklasf/python-chess/blob/master/chess/pgn.py If it can be called from Python (using ctypes or cffi) it would be more awesome. Maybe these are too ambitious ideas, but who knows :)

mcostalba commented 7 years ago

@sshivaji ok, now the tool stores result and game ID....same speed as before!

See 92231c94062159 and especially read log message

@sshivaji @gbtami In general, it is ok for me to build up something more ambitious, but without duplicating what is already existing and very good like python-chess.

My design idea is to write in low level C/C++ only the stuff that must be blazing fast, but writing the UI with something more flexible and higher level, in this case I am for Python because I like it and I know it a bit (although much less than C++). In particular the look up code should be another tool, external from the parser that just read PGN files and creates big index files.

Regarding the backup store, I think we can use the original PGN files for the moment: they are complete, easy to lookup (using the big index built by the parser) and very convenient. I don't think that moving the same information from one media (the PGN files) to another media (some DB) gives a real added value for the moment.

gbtami commented 7 years ago

pgn parsers in python-chess and in pychess are really slow. See https://github.com/niklasf/python-chess/issues/80

From python-chess and pychess point of view the ideal solution would be if they can call chess_db functions from Python code. I mean not just create a lookup code to use the chess_db created index files but call chess_db fast C++ parser function from Python to solve other use cases.

For example in pychess I started to implement database functionality in SQL database. See https://github.com/pychess/pychess/tree/master/lib/pychess/Database Besides usual stuff I'v added a bitboard table to help future positional search possibilities. I create these bitboards in pgn parsers simple_parse_movetext(). This shows that polyglot creation is ok for opening book like database creation, but not for general chess database use cases. I can imagine users of python-chess pgn parser can have several other use cases.

Maybe besides process_pgn() implement another function something like a Python generator function which can emit items on all parse_game() call, where emitted item can be the list of san_to_move() parsed moves. (Disclaimer: I'm not a C/C++ programmer, so what I'm talking about here may be complete garbage.)

mcostalba commented 7 years ago

I am not sure I understand the use cases you are proposing.

Could you please make some examples of them the form: possible input query and corresponding expected output

On Saturday, November 12, 2016, Bajusz Tamás notifications@github.com wrote:

pgn parsers in python-chess and in pychess are really slow. See niklasf/python-chess#80 https://github.com/niklasf/python-chess/issues/80

From python-chess and pychess point of view the ideal solution would be if they can call chess_db functions from Python code. I mean not just create a lookup code to use the chess_db created index files but call chess_db fast C++ parser function from Python to solve other use cases.

For example in pychess I started to implement database functionality in SQL database. See https://github.com/pychess/pychess/tree/master/lib/pychess/Database Besides usual stuff I'v added a bitboard table to help future positional search possibilities. I create these bitboards in pgn parsers simple_parse_movetext(). This shows that polyglot creation is ok for opening book like database creation, but not for general chess database use cases. I can imagine users of python-chess pgn parser can have several other use cases.

Maybe besides process_pgn() implement another function something like a Python generator function which can emit items on all parse_game() call, where emitted item can be the list of san_to_move() parsed moves. (Disclaimer: I'm not a C/C++ programmer, so what I'm talking about here may be complete garbage.)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mcostalba/chess_db/issues/6#issuecomment-260120688, or mute the thread https://github.com/notifications/unsubscribe-auth/ABDGAfnWQSXU_7aFlwoXlBskgB_TPZ8xks5q9bfJgaJpZM4KwJ1B .

sshivaji commented 7 years ago

@mcostalba, wow on the game id commit. I never thought we could stuff it in the learn parameter. However, I have a question/confusion. In a particular position, there could be a thousand games that are played from it. How will you encode all 1,000 game id offsets in the learn parameter? I will aim to support your latest api on my UI - https://github.com/sshivaji/kivy-chess soon.

@gbtami, I wonder if its better to open a different issue on the python api request. I want to respectfully separate the game id output discussion and the positional search api request (I do like positional search api and will be commenting on the other issue too) :) Putting on my tournament chess player hat, I can think of searches such as 2 bishops vs Bishop and Knight, lookup by pawn structure, lookup by piece path etc. I think thats why @gbtami is proposing a python api that allows users to extend your C code and leverage it to perform more complex searches.

sshivaji commented 7 years ago

@mcostalba I think I finally understand your game_id output idea. We have to traverse down the tree to get all the games. Will try out this out and comment back. Would be great if this works without more complex storage.

mcostalba commented 7 years ago

It is not a tree search but like a list one because all positions are sorted by key and move, so once you find the first one the others just follow consecutively.

See how the tool writes the book entries down to the disk...

On Saturday, November 12, 2016, Shivkumar Shivaji notifications@github.com wrote:

@mcostalba https://github.com/mcostalba I think I finally understand your game_id output idea. We have to traverse down the tree to get all the games. Will try out this out and comment back. Would be great if this works without more complex storage.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mcostalba/chess_db/issues/6#issuecomment-260134141, or mute the thread https://github.com/notifications/unsubscribe-auth/ABDGAfo8KVk2XBDyf16N4Yx57oxCqcHNks5q9fGQgaJpZM4KwJ1B .

mcostalba commented 7 years ago

BTW regarding advanced search it is very easy because do_move() computes also material and pawns hash key.

It means it would be possible to look up immediately a given material distribution, a given pawn structure or both at the same time.

On Saturday, November 12, 2016, Marco Costalba mcostalba@gmail.com wrote:

It is not a tree search but like a list one because all positions are sorted by key and move, so once you find the first one the others just follow consecutively.

See how the tool writes the book entries down to the disk...

On Saturday, November 12, 2016, Shivkumar Shivaji < notifications@github.com javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

@mcostalba https://github.com/mcostalba I think I finally understand your game_id output idea. We have to traverse down the tree to get all the games. Will try out this out and comment back. Would be great if this works without more complex storage.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mcostalba/chess_db/issues/6#issuecomment-260134141, or mute the thread https://github.com/notifications/unsubscribe-auth/ABDGAfo8KVk2XBDyf16N4Yx57oxCqcHNks5q9fGQgaJpZM4KwJ1B .

gbtami commented 7 years ago

@sshivaji I'm sorry for hijacking this issue with SQL and Python ideas! @mcostalba I'v opened issue #9 to discuss my lib idea.

sshivaji commented 7 years ago

This looks interesting, it looks like it would be trivial to combine this with a position search as well? In terms of api, I have an interesting idea now, can we call your ./parser program with an existing polyglot file and then post a parameter, search as material search and position and then get output from ./parser. Once this output is something generic, we can maybe make a python api for it?

@gbtami, no worries, actually your contribution is making a huge impact even on this discussion :)

sshivaji commented 7 years ago

@mcostalba, I see on list instead of tree. However, would that not miss out on position transpositions? Nevertheless, this looks quite interesting and I will try out the api asap :)

sshivaji commented 7 years ago

@mcostalba, I am now able to read your game_id output. I now understand how the polyglot entries will work. For the start position, we will see 2 million entries in a 2 million game database :)

I think the offset might not be working in some cases. I will post some output and bugs if I find any. The lookup speed is quite reasonable, especially given that polyglot positions are found via binary search.

sshivaji commented 7 years ago

@mcostalba, one more request. Do you think it makes sense to return an offset of where a game starts as opposed to a random location in the game? Start location is of use, random location requires parsing in both directions, at least with start location, we only have to parse one way.. :) I assume that your performance will be similar as you are parsing a game at a time?

mcostalba commented 7 years ago

No, you have to parse only backward and I suggest to use something like find("[Event", - 1).

We can't use offset of start position because offset is truncated /rounded to 8 bytes.

If you post the code I will check it.

sshivaji commented 7 years ago

I see. I will post code soon to the kivy-chess repo.

One other question related to performance. I notice that the polyglot lookup is slow with all the game ids (takes a few seconds). Sometimes, I want the output of only the different moves quickly without all games, at other times (when it is feasible), its useful to see all games. Is there a trick to allow to see both (the full and the non full option) in the same file? (I might be too greedy.. :) )

mcostalba commented 7 years ago

Few seconds is too much, there should be something wrong.

When I see your code I can give you some suggestion maybe.

I prefer to comment directly on the code. It is more effective IMO.

On Wednesday, November 16, 2016, Shivkumar Shivaji notifications@github.com wrote:

I see. I will post code soon to the kivy-chess repo.

One other question related to performance. I notice that the polyglot lookup is slow with all the game ids (takes a few seconds). Sometimes, I want the output of only the different moves quickly without all games, at other times (when it is feasible), its useful to see all games. Is there a trick to allow to see both (the full and the non full option) in the same file? (I might be too greedy.. :) )

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mcostalba/chess_db/issues/6#issuecomment-260862893, or mute the thread https://github.com/notifications/unsubscribe-auth/ABDGAdvt--LIhzwDd4FE3jEBo16Q8hEZks5q-p5ZgaJpZM4KwJ1B .

sshivaji commented 7 years ago

Of course, its more effective directly on the code!

My code is at https://github.com/sshivaji/kivy-chess/blob/libchess/main.py#L5150 in the libchess branch using python-chess from https://github.com/sshivaji/python-chess/tree/pgn

Basically its python code calling C code in various places. I get Elapsed lookup time was about 3 seconds for the start position, but after that it gets much faster. I wonder if the python code needs to be optimized in polyglot_opening_book.py.

Note that in the old format it was very quick and not this slow (though a few seconds on a 2 million game base is not necessarily slow either)

sshivaji commented 7 years ago

I finally narrowed it down. Its with the polyglot seek_position code in https://github.com/sshivaji/kivy-chess/blob/libchess/chess/polyglot_opening_book.py

If we have 2 million game ids in the start position, the python code below that reads - (This code block takes too long) is the culprit. On an i7 (using just one CPU), it takes about 1.5 seconds for 2 million game ids. Let me know if you have an idea to improve the polyglot multi-entry seek speed here. This code will probably run faster in C/C++, however, there might be a way to improve it in python.

`

def seek_position(self, position):
    # Calculate the position hash.
    import time
    start_time = time.time()
    key = position.__hash__()
    # end_time = time.time()

    # Do a binary search.
    start = 0
    end = len(self)
    while end >= start:
        middle = (start + end) / 2

        self.seek_entry(middle)
        start_time = time.time()
        raw_entry = self.next_raw()
        # end_time = time.time()
        # print("Elapsed next_raw_entry time was %g seconds" % (end_time - start_time))

        if raw_entry[0] < key:
            start = middle + 1
        elif raw_entry[0] > key:
            end = middle - 1
        else:
            # Position found. Move back to the first occurence.
            # This code block takes too long if we have many positions (e.g. start position has 1M game_id entries)
            start_time = time.time()
            seek_count = 0
            self.seek_entry(-1, 1)
            while raw_entry[0] == key and middle > start:
                # print("seek..")
                seek_count +=1
                middle -= 1
                self.seek_entry(middle)
                raw_entry = self.next_raw()

                if middle == start and raw_entry[0] == key:
                    self.seek_entry(-1, 1)
            end_time = time.time()
            print("Elapsed move back to first occurence time was %g seconds" % (end_time - start_time))
            print("seek counts: {0}".format(seek_count))

            return

    raise KeyError()

`

sshivaji commented 7 years ago

I just looked over how you are doing poylgot get left most entry, the code I posted above is old python code taken from another project (of course its still my fault :)) and is and not as optimized. Its especially telling when we have many entries. I will change the code to match what you have in Stockfish and the performance can be far improved. It seems that we are doing an O(n) lookup one by one backwards which is quite inefficient. Will improve and report back. Let me know if you have anything to add to this, and thx a lot for the chessdb project :)

mcostalba commented 7 years ago

@sshivaji ok, it is just a quick hack, but I have added interactive capability to the parser tool, in particular you can call "parser" process and keep it open and probe positions passing the book file name and teh fen string to lookup. It works also form command line like:

./parser ../pgn/bali02.bin rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1

I think you can call this form your python library.

mcostalba commented 7 years ago

@sshivaji sorry, the calling command is

./parser find ../pgn/bali02.bin rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1

sshivaji commented 7 years ago

I have just integrated your parser tool in the python program - https://github.com/sshivaji/kivy-chess/tree/libchess. It is very fast now!! I will test out the game_id output for bugs (or any potential issues) and report them soon.

Few simple thoughts:

  1. Does it make sense for metadata of the PGN file to be exported to CSV or something in your output. E.g. would be cool to get game_id, white, black, white_elo, black_elo, tournament, etc so that one can use this output to query header and position. An example query would be to search for all of Carlsen's games when he was more than 2700 on the black side of the French. With just the game_id output, I can get all games from the french, but can't filter on Carlsen and more than 2700 elo.

An example of that support is something I put up at http://www.drshivaji.com:3334/

However, the parser code for that solution is vastly inferior (by speed and storage size) to what you have :)

  1. Minor: Maybe its worthwhile calling this tool chess_parser for make and make install instead of just "parser".
sshivaji commented 7 years ago

Now reaching interesting conclusions. The main one is that python is still slow for many tasks compared to C++.

  1. Getting 2M positions via your api on the initial position and then parsing and sorting all entries took 7 seconds in python on an i7! I think there should be a way to aggregate all games (with game_ids) for Win / loss /draw info and the code should be in C++.

  2. Maybe I will open a separate issue on exporting header information.

    1. I will open a pull request on the first point I made above. Having c++ api was quite fast.
mcostalba commented 7 years ago

@sshivaji to speed up bulk counting for positions with many entries I could add some meta-entry at the begin of a position entry sequence.

Entry format is:

struct PolyEntry {
    PKey     key;
    PMove    move;
    uint16_t weight;
    uint32_t learn;
};

So while sorting entries I can add some meta-info entries like:

<pos key>, MOVE_NONE, 0xFFFF, <meta learn>

where learn field has the MSB for the type and the other bits for the value like:

  enum MetaType { TotalCnt, WhiteWin, Draw, etc.. }

  MetaType t = learn >> 24 ;
  uint32_t  value = learn && 0xFFFFFF ;

Due to be stored with highest weight, these will be sorted as first, and should give summary info of the position without traversing all the list.

I think this still should be compatible with Polyglot format because MOVE_NONE should be ignored by book readers.

sshivaji commented 7 years ago

Thats a cool and far better idea than my sort api request! I have a small variant on the above for proposal.

It's also useful to know all the moves that are played in the position. Do you think it makes sense to have regular polyglot entries (without the full mode) at the top of the list with their normal weights and the game id ones with 0 weight (the result and game_id will be encoded in the learn flag as you mentioned before)?

With this method, we can get compatibility with regular polyglot book format and also with the extended format. The benefit is that we can quickly see all moves played from the position and their weights too.

For total count, 24 bits will be enough, as its about 16.7 million. White win and draws etc can all be in percentage, so maybe just use 7 bits for white's win percentage and from that we can compute W/D/L etc easily? Maybe there is an even better solution.

sshivaji commented 7 years ago

Ultimately, I am fine with however this is done as long as the meta information can be extracted from python. Happy to test and integrate when needed :)

mcostalba commented 7 years ago

@sshivaji I have just pushed a commit incorporating these ideas, in particular, differently from my first proposal, now I compute meta info at move level base, not at position base. This should allow easy look up of all the moves played in a given position.

sshivaji commented 7 years ago

Just tried it. Currently, the meta entries seem to be at the end rather than at the beginning of the polyglot index for a hash.

The meta entry does seem to show up with the move weight being the actual move as expected.

mcostalba commented 7 years ago

@sshivaji thanks for testing. Should be fixed now, can you please pull again and retest? Thanks.

sshivaji commented 7 years ago

For some reason, sorting is taking forever now after the last commit with millionbase-2.22.pgn

I get

Sorting... (and this takes a long time to finish, did not finish after 5+ mins on my machine).

mcostalba commented 7 years ago

@sshivaji Thanks for reporting. It should be fixed now.

sshivaji commented 7 years ago

Sorting and retrieval works now. The JSON format is a great touch!

I might have a question on the weights, they might be a bit off in the output. However, I will test more with my user interface and report issues within a day or two.

sshivaji commented 7 years ago

I want to do more testing but I have an issue with game_id offset, the offset is not working for me. It may well be my fault. Here is my code:

The game byte offset for me is bigger than the length of the file in most cases. I wonder what I am doing wrong. Looking over your C++ code, the relevant line is, and it would seem casting to 32 bit unsigned int is right..


 const uint32_t learn =  ((uint32_t(result) & 3) << 31)
                          | ((uintptr_t(data) >> 3) & 0x3FFFFFFF);

            for i, e in enumerate(polyglot_entries):
                if i>=1:
                    break
                l = bitstring.BitArray(uint=e.learn, length=32)
                print("e.move : {0}".format(e.move))

                result = l[:2].uint
                if result == 2:
                    result = '1/2-1/2'
                elif result == 0:
                    result = '1-0'
                elif result == 1:
                    result = '0-1'
                else:
                    result = '*'

                print ("result: {0}".format(result))

                del l[:2]
                # same effect as l[1] = 0 and l[2] = 0

                game_offset = l.uint-500
sshivaji commented 7 years ago

Also, I have verified that the first 2 bits of the learn byte get chopped off correctly with my logic.

sshivaji commented 7 years ago

Now, I am wondering whether its because the address you are providing is not relative to the start position of the file. If so, maybe we have to subtract the start address of the file. However, I am not sure if this is the fix given the presence of mmap etc.

mcostalba commented 7 years ago

Yes, you are right. It should be fixed now.

On Tue, Nov 29, 2016 at 11:04 AM, Shivkumar Shivaji < notifications@github.com> wrote:

Now, I am wondering whether its because the address you are providing is not relative to the start position of the file. If not, maybe we have to subtract the start address of the file. However, I am not sure if this is the fix given the presence of mmap etc.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mcostalba/chess_db/issues/6#issuecomment-263528141, or mute the thread https://github.com/notifications/unsubscribe-auth/ABDGAcRAiI9edJgPFZ2txxpAd47q7a-Pks5rC_ikgaJpZM4KwJ1B .

sshivaji commented 7 years ago

The address is now relative but still does not seem right. To reproduce:

  1. go to the PGN folder.
  2. parser book romero.pgn full
  3. parser find romero.bin rnbqkbnr/pppppppp/8/8/3P4/8/PPP1PPPP/RNBQKBNR b KQkq d3 0 1 (this looks up position after 1.d4 in romero.bin)

I get:

{
    "fen": "rnbqkbnr/pppppppp/8/8/3P4/8/PPP1PPPP/RNBQKBNR b KQkq - 0 1",
    "key": 9443689642921087454,
    "ofset": 3248,
    "moves": [
       {
            "g8f6": "1"
       },
    ]
}
  1. Now I lookup game_id at that offset in the polyglot_entry and from the learn bit after stripping off the result, I get 457 as the offset. When I seek to byte 457, I dont get the game starting with 1.d4, the game starts with 1.e4 and is the first game. The game starting with 1.d4 is the 2nd game.

I wonder what I am doing wrong :) The only guess in my head is that its mmap related.

sshivaji commented 7 years ago

@mcostalba Never mind, turns out I have to multiply the offset by 8 to get the byte offset. I had no idea I had to do this :) I will test and continue to report issues.

mcostalba commented 7 years ago

Sorry, I forgot to mention that offset is right shifted to allow indexing bigger files.

On Thursday, December 1, 2016, Shivkumar Shivaji notifications@github.com wrote:

@mcostalba https://github.com/mcostalba Never mind, turns out I have to multiply the offset by 8 to get the game position. I had no idea I had to do this :) I will test and continue to report issues.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mcostalba/chess_db/issues/6#issuecomment-264085861, or mute the thread https://github.com/notifications/unsubscribe-auth/ABDGAW_8p5E2LDLko_BdZZEvIBl22ksHks5rDmHAgaJpZM4KwJ1B .

sshivaji commented 7 years ago

No worries!

I have fixed result bit parsing and storage now. It now seems to be working fine on a few databases (tested on small and large PGN). I have issued a pull request for the same.

sshivaji commented 7 years ago

One more pull request to cover win/draw/loss information in the JSON. The weight flag in the polyglot book actually is not useful for humans as much as it is for computers. W/D/L information is far more relevant.

Related comment is that I am stunned by the speed of lookups on a polyglot book. What takes 8+ seconds in python for the initial position polyglot game_id lookups and WDL stat generation in the 2.2 million game base takes only 0.2 seconds in C++ on the same hardware!!

real    0m0.247s
user    0m0.240s
sys 0m0.004s
sshivaji commented 7 years ago

I will close this issue given that win/draw/loss stats are now possible in addition to game_id and file offset information. Will open new issues on things like pawn structure search etc