mcostalba / scoutfish

Chess Query Engine
GNU General Public License v3.0
156 stars 23 forks source link

Differences to MongoDB #8

Closed madnight closed 7 years ago

madnight commented 7 years ago

Could you explain why scoutfish is better or different compared to storing PGN Files in e.g. MongoDB? A small comparision in the Readme would be great.

mcostalba commented 7 years ago

Because this tool does not store PGN files :-)

The aim is to search a DB with complex queries...perhaps what is not clear to you is that when I say "query" I don't mean SQL, but just a rule or a set of rules and/or patterns that the game has to match to be selected.

If you are willing to spend just a couple of minutes reading the README, I guess you can figure out yourself.

madnight commented 7 years ago

Okay because queries sounds like database to me and MongoDB can store PGN converted to JSON as well and ouputs JSON and is able to handle complex requests (non-sql) very fast. MongoDB does not have an UCI interface, but that could be made by a wrapper.

mcostalba commented 7 years ago

For the kind of query we are talking about you need chess knowledge, that's why scoutfish is based on a chess engine.

On Friday, December 9, 2016, Fabian Beuke notifications@github.com wrote:

Okay because queries sounds like database to me and mongoDB can store PGN converted to JSON as well and ouputs JSON and is able to handle complex requests (non-sql) very fast

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mcostalba/scoutfish/issues/8#issuecomment-266003701, or mute the thread https://github.com/notifications/unsubscribe-auth/ABDGAWp-uSlM7aP8oE3RQsQ69Evf3fWkks5rGUrlgaJpZM4LI2IA .

madnight commented 7 years ago

Okay so wouldn't it be possible to connect another backend (storage) e.g. MongoDB instead of writing an own DB and then combine that backend with an chess query engine. MongoDB for itself is a project with 36k+ commits, so its not an easy task to write an stable, fast and reliable Database. Following the unix philosophy "Do One Thing and Do It Well", so not merging db + query language + engine into one thing, better splitting them into separate small reusable easy to understand tools. Or at least explain why this cant be done or will not be done in the Readme (reason for this issue)

madnight commented 7 years ago

when you are doing this "scoutfish will create a file called my_big_db.bin" and working on that file + allowing queries then you are basically already wrote your own singe file db, that is what i find a little bit confusing

gbtami commented 7 years ago

@madnight as I see scoutfish (and chess_db) is awesome for working with "static" chess game collections stored in one (even huge) .pgn file. If you want frequent add/edit/delete/annotate/etc. to your chess games you better to use databases using native back ends like chessbase/scid/etc.

It can be an interesting discussion which different back end format is ideal for different use cases. Some interesting reading from Tord Romstad, HGM and Thibault: https://www.reddit.com/r/chess/comments/47fexz/towards_a_replacement_for_the_pgn_format/ http://talkchess.com/forum/viewtopic.php?start=0&t=50152&topic_view=flat https://hu.lichess.org/forum/general-chess-discussion/how-is-lichess-implemented-with-mongodb

sshivaji commented 7 years ago

I have used Mongodb before quite a bit (it was a few years ago).

Mongodb will resolve all queries using information only from the pgn and/or db files, and it will treat the database as a big text repository. Mongodb will not allow the type of search that is currently supported.

Examples are sub-fen, material pattern, sequence search. To get this to work with mongodb, one would have to put it in a lot more information in mongodb for every position. Sub-fen search seems quite hard to achieve even then. Sub-fen search seems quite unique to chess. Full fen search might be possible but again, requires a lot of storage to put in full fens on all positions.

If your query is chess centric such as find all positions that have 2 Bishops vs Bishop and Knight, which started from a particular position, the time and storage on mongodb would be much larger.

For fast specific use cases like this, this solution seems far better with far less storage requirement too.

The trick here is that a single chess game in PGN has much more information encoded in it than the text it contains. It contains sequences, pawn structures, piece paths, material count, number of attackers on a square etc. Indexing all this into a text database will introduce a lot of bloat and will make queries slow.

sshivaji commented 7 years ago

@gbtami

I agree that the PGN format can be improved. That's why it's a pain to parse. However, even an improved PGN format makes pattern search very hard.

Scoutfish and chess_db build indexes for the DB. Though it's not yet supported, adding a game to the db can be optimized for scoutfish, but there is less need to do so as building a game index is very fast when I tested. If one is building a chess UI (such as pychess), I think you can support adding, replacing game directly on the pgn file and then call chess_db or scoutfish on the new pgn file to index it as a background task.

Text replace on PGN is quite fast using chess_db indexes, because you know the end offset of the game and you can figure out the starting offset. Then you can delete or replace that region of text and then re-run chess_db.

gbtami commented 7 years ago

@sshivaji I'm a little bit skeptic about .pgn + .bin format can be suitable for a chess UI (such as pychess). Don't you think rewriting .pgn and recreating .bin on every add/edit/delete will need unacceptable time on million game db (>one sec)? Using sqlite it's instant now.

madnight commented 7 years ago

You can also build very fast indexes with MongoDB + the nice features gbtami mentioned (add/edit/delete/annotate/etc) + great community support + many tools. After converting your data and building your indexes you could make request e.g. regex based for substring search like: db.chess_db.find( { fen: {$regex: '8/8/p7/8/8/1B3N2/8/'} } ) And you have all the "cool" functions like, group, mapReduce and so on out of the box.

scoutfish is a great idea i'm only suggesting to use a common known database as backend (even sqlite comes into mind) and stack all the new query features + UCI Interface on top of that, without writing a own database implementation.

sshivaji commented 7 years ago

@gbtami. Of course, thats only needed if you want to modify the pgn file. I support it on the kivy-chess UI and its very fast, barely a second, because we are deleting from specific regions in the file and appending only to the end of the file. For replace, even chessbase appends to end of file, so it will be very fast. Storing a game in sqlite has its own challenges.

I think the .bin format done for both chess_db and scoutfish is a god send for chess programs everywhere. Almost all programs choke on a 5M game database without this kind of utility. My feeling is for pychess, you dont need to update the .bin file, update only as needed. People usually dont write much to large databases, they write to smaller user databases. Even chessbase had a concept of a write database which was typically small.

mcostalba commented 7 years ago

Just for completeness, the bin formats of chess_db and Scoutfish are totally different: chess_db stores data in a way very similar to a polyglot book, instead Scoutfish is completely different to allow for rich chess queries. BTW in Scoutfish format you don't need to rewrite the bin if you add/delete/modify some game. In case of add some new games you can just append related info to the bin (almost instantaneous).

sshivaji commented 7 years ago

@madnight,

The point is that converting data will incur a lot of storage, likely 10x+. You then have to encode FENs for all positions in the database (including variations). What about sub-fen and material encoding? Still, how do we answer the question of "find me all games starting with a particular fen where white sacked a piece and won the game". I don't think this can be answered by Mongodb storage unless we add this to the information bloat as well. I feel to encode all information such as piece sac, queenside majority, isolated pawn, superflous knight etc will be practically impossible.

I am not criticizing mongodb but the difficulty of encoding domain specific data to text is quite immense.

Right, now I am very impressed by the performance of this tool, very complex queries can be done 100 million moves at a time on just one core.

Add/edit support is a database function not a function for index generation. We are constrained to the PGN format for source files.

Also, think of the perspective of the semi-pro or pro chess player, they want fast answers to common questions not a programmer's platform with map-reduce. I mean this as a semi-joke, but also am semi-serious. I know several GMs who barely know how to use chessbase. Adding in map-reduce will confuse them for a while..

sshivaji commented 7 years ago

@mcostalba Had no idea how to append things instanteously to bin files. What is the command to do this in scoutfish for appends? :)

If this is the case, I will focus my UI support mainly for scoutfish.

gbtami commented 7 years ago

@mcostalba thx for the info about .bin difference. I have not noticed it myself.

madnight commented 7 years ago

The point is that converting data will incur a lot of storage, likely 10x+. You then have to encode FENs for all positions in the database (including variations). What about sub-fen and material encoding?

Saving the information into a .bin (or any extension you like, since it's not a standard) file is no dark magic, its tough to find the right source code file / line since there are many files here, but if mcostalba gives a specification or shows up the source line of code where this happens, then i could explain it in more detail.

sshivaji commented 7 years ago

@madnight

Wait, the point is that he is ONLY storing moves in the compact .bin file.

For the query he is replaying the game with stockfish and figuring out the pattern search query on the fly. If we are storing all that information in a .bin file then we are as good/bad as any other solution.

gbtami commented 7 years ago

@mcostalba @sshivaji can you add a short spec of .bin format in README.md (or somewhere else) please?

madnight commented 7 years ago

For the query he is replaying the game with stockfish

isn't replaying a game with stockfish very slow? If stockfish replay the games while you are doing an query, the query would take ages. I really expect the data to be somehow precomputed and saved into the "mysterious" .bin file

sshivaji commented 7 years ago

@madnight,

This repo has optimized stockfish to replay the games very quickly! The data is not in the mysterious .bin file (then it would be really mysterious) :)

Its like search of 90 million moves a second on Stockfish single core right now for search!

madnight commented 7 years ago

@sshivaji okay and why cant i replace the .bin file with sqllite or mongodb? And before you answer that could you please post a snippet, with the contents of the .bin file after importing a game.

gbtami commented 7 years ago

Seems this issue need some chat instead :) If you agree join to https://gitter.im/scoutfish/Lobby

sshivaji commented 7 years ago

@madnight

Thats possible. However, I doubt you will see a speed benefit given that the .bin file is compactly encoded and optimized for performance. It would be a pain to get each move from mongodb instead of C code to access the compact representation.

The relevant code for the .bin file is here (look at variable ofs): https://github.com/mcostalba/scoutfish/blob/master/src/parser.cpp#L87

The .bin file is not a format that other programs should really care about, all info from the .bin file can be extracted via queries to scoutfish. Currently the scoutfish .bin stores moves and the result. Also, I am not sure what exactly others need from the .bin format, the source code has enough info I think.

Glad to join chat :)