Open brilee opened 6 years ago
How to upload data - can retroactively parse all the full-debug log SGFs we have, and also update selfplay workers to stream data into the table. Might consider separate tables for evaluation zoo games vs. selfplay workers; for 20x less data we could also just do the holdout games.
The dataset of evaluation games should be much smaller and much easier to make available in its entirety.
It seems like we're duplicating of per-game info (worker_id, completed_time, model_num, result, was_resign, resign_threshold) in the per-move table. Could we instead generate a unique key for each game and store that in the move table?
Agreed, we're duplicating some of the info. I think (worker_id, completed_time) should form a primary key, and model_num is too useful; may as well denormalize that. Result/resign/resign threshold could be normalized, but I suspect those are too useful to want to join all the time.
@tommadams what do you think this should look like, given that you would have to reimplement some form of this in the C++ version?
Well at that point there's probably not much to gain by normalizing the remaining data because it sounds like the primary game key would be almost as large as the data we're normalizing. So storing everything per-move does make sense.
Your schema sounds good to me. I wonder what the most reasonable import format is. I'm somewhat surprised/disappointed that BigQuery doesn't support proto as far as I can tell.
Right, BQ doesn't support protos because they are not self-describing. Avro is a proto-like binary format which is self-describing, and BQ does support Avro. Other import options I'm familiar with are JSON and CSV. (CSV will be utterly incapable of handling the arrays we'd be generating)
find sgf/ -path '*/full/*sgf' | xargs grep -o '];[BW]\[[a-s]\{2\}\]C\[-\?[0-9.]\{6\}*$' | tqdm | awk -F ":" '{ if ($1 != cf) { print cf": "tok; cf = $1; tok = ""; } } { sub(/^.*\[/, "", $2); tok = tok" "$2; } END { print cf": "tok }' | tee game_qs | wc
find sgf/ -path '*/full/*sgf'
find all sgf files
xargs grep -o '];[BW]\[[a-s]\{2\}\]C\[-\?[0-9.]\{6\}*$'
find all comments after a move (not this doesn't find the first Q on line 4)
tqdm
To go fast one must measure
awk -F ":" '{ if ($1 != cf) { print cf": "tok; cf = $1; tok = ""; } } { sub(/^.*\[/, "", $2); tok = tok" "$2; } END { print cf": "tok }'
Split each line on semi-colon ("
tee games_qs
temp holding file till these get imported into dataframes or bigquery
wc
verify assumptions about number of moves...
As compared to python file reading and parsing this command runs at light speed I process roughly 200 games/second, so all of v5 in an hour. Limitting factor is HDD reads :/
Changing the title to more accurately reflect the needed work
I'm working on this as part of minigo-pub VM
@sethtroisi @brilee do we still want to do anything with this?
This will probably start as a private BQ dataset; will have to consult to figure out how to offer data publicly/how to allow the general public to query over a ~1TB dataset for free-ish.
General schema of dataset:
This allows queries like " select model_num, first 5 moves of game, count(*) group by model_number, moves", which would show the most popular opening by model number. Or stuff like "select mcts_visit_counts_normalized[array_ordinal(move_num)] / policy_prior[array_ordinal(move_num)]" would allow you to rank moves by how unexpected they were.
(And by bigquery magic, these queries would probably complete in a few minutes, instead of the hour-long analyses we've been running so far...)