official-stockfish / fishtest

The Stockfish testing framework
https://tests.stockfishchess.org/tests
280 stars 129 forks source link

Accessibility and longevity of stored games. #697

Open Sopel97 opened 4 years ago

Sopel97 commented 4 years ago

Fishtest is producing a huge amount of relatively good quality games (compared to human games). Opening books also ensure high variability so the games cover many variations. Sadly right now they get deleted quickly and the access to them is painful - one has to download a single batch at a time.

If the games were more easly accessible and kept longer it would allow people to make data analysis on them. I was considering making a book from the agreggated results of the fishtest games to see how good it would be, but right now it's not possible to aqcuire the games in large enough amounts.

What would be required for this to happen?

If the storage space is a problem there are more compact binary formats for storing games than PGN. Less than 1 byte per move is easly achivable without complex compression and at speeds faster than PGN can be processed. Most of the PGN tags could also be removed, so only the most important would be kept - for stockfish the only one that matters for very old games is the FEN tag.

The file granularity could be solved by periodic (say, every few days) merging of the worker PGN files (that's when the conversion of the binary format could happen).

With this scheme, assuming an average of 200 plies per game, the whole history of fishtest could be kept well under 400GB of space.

If there is willingness to work on this I can provide help and tooling for compact game representation.

ppigazzini commented 4 years ago

@Sopel97 the PGNs for all LTC games should be already saved on Google Drive, check this wiki page https://github.com/glinscott/fishtest/wiki/PGN-files-of-games-played-on-fishtest

By the way fishtest server has 100GB of hard disk space.

Sopel97 commented 4 years ago

And I think it's apparent that that's a lot of files to download and google drive is not particuarily happy when wanting to download them all at once.

Also LTC games in the last 100 tests account only for 10% of all the games.

If the hardware problem is not a solvable problem right now then I guess this issue could be closed.

Sopel97 commented 4 years ago

After 5 hours of waiting for google drive to compress it it errored out.

tomtor commented 4 years ago

I will move the 2018 and 2019 games on the Google drive to separate zip files.

Sopel97 commented 4 years ago

I'm currently downloading each individual bz2 file with a python script. It goes at around one per second. I estimate there's around 200000 of them so I guess I'll be done in about 2 days! Not sure what happens when someone changes the directory structure midway, I guess we're gonna find out

tomtor commented 4 years ago

I did 2019.zip (4.5 GB). Now packing 2018.zip (14.9 GB).

Will leave the 2018 and 2019 subdirectories in place.

The file references shouldn't have changed, so if your script has fetched all references in advance it might just fetch them.

Sopel97 commented 4 years ago

Okay, I have the games, but right now I'm fighting with some issues.

Sopel97 commented 4 years ago
[Event "Batch 152: Bpsqt_tuned vs master"]
[Site "http://tests.stockfishchess.org/tests/view/5be9239d0ebc595e0ae33406"]
[Date "2018.11.13"]
[Round "64"]
[White "Base-30a905c"]
[Black "New-ef0f1f2"]
[Result "1-0"]
[FEN "rn1qkbnr/ppp1pppp/8/3p1b2/8/N6P/PPPPPPP1/R1BQKBNR w KQkq -"]
[PlyCount "0"]
[SetUp "1"]
[Termination "abandoned"]
[TimeControl "51.5+0.51"]
 1-0

this is not a valid PGN. It's missing a new line before 1-0

I closed this by mistake

vondele commented 4 years ago

@Sopel97 that would be a cutechess bug, I assume.

Sopel97 commented 4 years ago

Yes, possible for a different issue elsewhere. Just leaving a note here.

Overall the process of getting all the games is less painful now. After downloading the yearly zips it took only a few hours for a python script to merge all games into one file.

One more issue I noticed is that many pgn.bz2 files contain records of request errors, so I had to filter them out by only allowing files that start with '['. Overall there was about 220000 pgn.bz2 files inside.

Sopel97 commented 4 years ago

The file "2018.zip/2018/5b1302d70ebc5902a81676a5/5b1302d70ebc5902a81676a5-539.pgn.bz2" doesn't end with "\n\n", and is another one that caused problems... Actually it seems completely malformed because it claims to have 108 plies but the last move is "34. Qf4 {+1.92/18 0.44s} Nf5 {-2.31/21 3.5s} ". No termination at the end of movetext too...

vondele commented 4 years ago

I guess we're always just uploading what cutechess left behind, without checking. If the user killed cutechess while it was writing, I think we would return a truncated file.

Sopel97 commented 4 years ago

I'm finally done with the files up to May 2020.

2018 - 5bd026d10ebc592439f8ea0d-183.pgn missing round 44 - completely borked pgn around that

Some games had "*" or "?" in the result tag, but these had valid PGN structure.

Other than that all problematic pgns were catched by checking that the decompressed file starts with "[" and ends with "\n\n".

Stats:

File size : 5520684555 (binary format without tags)
Games     : 43646346
Positions : 5659979883
Wins      : 8304453
Draws     : 29962898
Losses    : 5378995

Considering how big of a pain this process was I may post the python scripts I used to process this data (but doesn't handle the severely malformed pgn files) if someone needs that.

vondele commented 4 years ago

nice.. great compression. BTW, I've been downloading the LTC pgns since ~January. python-chess parses them as far as I can tell (or at least no error messages I've seen). I do seem to have many more games (not so easy to count, since it is >500000 pgn files, not a database, but I guess that's more than 50M already). Makes me wonder if all files end up on the google drive.

Sopel97 commented 4 years ago

I got all files from google drive at the time (a week ago). Only about 10 thousand were filtered away. Around 10% of all the games on fishtest are LTC so 43 million does sound like less than it should be indeed. 500000 files should be around 100-110 million games. I would appreciate if they were made public. Also all the big problems seemed to be with 2018 files only, the rest is reasonably compliant.

python-chess is terribly slow, so I prefer regex for combining/filtering whenever possible (that's where the PGN's guarantee about new lines separating tag section and movetext section comes handy) and my own software for heavy-lifting.

vondele commented 4 years ago

I'd like to make them available, but don't know well how to do it practically. Right now, it is about 300Gb of data.... I guess I could start by compressing it, that will help with an eventual uploading to a file server or so.

Sopel97 commented 4 years ago

after stripping comments and unneccesary tags (I don't think anything outside of Date and Result is important for older games), and compression it should take ~30GB I guess. Still too much for google drive, but at least much more managable. Merging the files into a few bigger ones could help too, I noticed the compressors/decompressors don't really like that many files and the process is slower.

I will look through the games I have tomorrow and look for missing days, I'm not sure there's a good way of looking for missing tests unless there's a way to get all LTC test URLs easly.

edit. this works for finding all ltc games I guess

import requests
import re

test_link_regex = re.compile('<a href=\"/tests/view/(.*?)\"')

def make_fishtest_tests_urls(ids):
    return ['https://tests.stockfishchess.org/tests/view/' + id for id in ids]

def get_test_ids(page_id):
    r = requests.get('https://tests.stockfishchess.org/tests/finished?page={}&ltc_only=1'.format(page_id))
    content = r.content.decode('utf-8')
    return test_link_regex.findall(content)

def get_all_ltc_tests():
    tests = []

    page_id = 1
    while True:
        ids = get_test_ids(page_id)

        print('Page: {}, Tests: {}'.format(page_id, len(ids)))

        if not ids:
            break

        urls = make_fishtest_tests_urls(ids)
        tests += urls

        page_id += 1

    return tests

out_path = 'ltc_list.txt'
with open(out_path, 'w', encoding='utf-8') as file:
    tests = get_all_ltc_tests()
    for test in tests:
        file.write(test + '\n')

https://pastebin.com/KLh7SfnX

looks like there's less tests than the displayed count claims

Sopel97 commented 4 years ago

Out of 6345 LTC games listed on the stockfishchess.org/tests 1346 are on google drive. That means 5039 are not. This is the list of them https://pastebin.com/bGgkfaHa (numbers may vary slightly due to data being gathered at different times)

tomtor commented 4 years ago

The MongoDb storage has been broken for at least 6 months, probably more. This was fixed with the server upgrade.

So that explains a part of the missing files.

If files are also missing recently, that should not happen.

Sopel97 commented 4 years ago

there are big gaps between 2019-08-09 to 2019-11-07 and between 2020-04-25 to 2020-06-06. I don't have data (yet) about how many LTC tests were made every day so I cannot say whether there are tests missing in other date ranges, but the big amount of missing tests makes me thing this is likely.

vondele commented 4 years ago

@Sopel97 actually, I think the comments and so are rather useful, since that gives us about 1B positions analyzed... could turn out to be useful. I do have most of the games for 2020-04-25 to 2020-06-06.

alwey commented 4 years ago

Which cutechess-cli version has been used by fishtest to generate the stored notations? If it is a version older than January 2017 then there may be incorrectly truncated PGN output for games with 0 plies. Ilari fixed this.

If such output was generated with a newer version then cutechess-cli would have a new problem.

ppigazzini commented 3 years ago

@alwey fishtest started using cutechess-cli 1.2.0 in September 2020, see https://github.com/official-stockfish/books/pull/12