Closed ultrabear closed 10 months ago
initial database:
CREATE TABLE IF NOT EXISTS runs (
user TEXT,
code TEXT,
day INTEGER,
part INTEGER,
time REAL,
answer INTEGER,
answer2
);
CREATE TABLE IF NOT EXISTS solutions (
key TEXT,
day INTEGER,
part INTEGER,
answer INTEGER,
answer2
);
this is undocumented, doesn't use nonnull types, doesnt use strict mode, or relational mappings, we can do much better answer2 is even untyped in both of the tables, since these tables are in wobbly mode it probably just defaults to text or something, but ill have to consult the docs for that
Initial pass (no refactor yet):
CREATE TABLE IF NOT EXISTS runs (
user TEXT NOT NULL,
code TEXT NOT NULL, /* submitted code, should probably be a compressed BLOB tbh */
day INTEGER NOT NULL,
part INTEGER NOT NULL,
time INTEGER NOT NULL, /* either ns or ps resolution, initially used ns */
answer TEXT NOT NULL
/* answer2 never used in code, just assigned same value as answer */
) STRICT;
CREATE TABLE IF NOT EXISTS solutions (
key TEXT NOT NULL, /* seems to point to a filename */
day INTEGER NOT NULL,
part INTEGER NOT NULL,
answer TEXT NOT NULL
/* answer2 similarly useless and seemingly placed to get around INTEGER for the answer section */
) STRICT;
Tooling that may or may not make our lives easier:
I would argue that the code itself doesn't belong in the DB, but until we actually see more than tens of thousands of rows, it probably doesn't matter much, so long as we don't SELECT
it when we don't need it.
One possibility in the back of my mind for 0.5 and later is the ability to store/query longer build/run logs, but it's possible we don't want to allow that for anti-cheat reasons. Not committed to anything specific for that just yet, but I did want to get it out there that such a thing was a possibility.
I tend to stay away from ORM's, since its the first thing to go when performance is not good, sqlite has very forgiving typesignatures anyways
Next iteration:
/*
This will be used to lookup runs for the leaderboard,
it is a mirror of data already in `runs`
*/
CREATE TABLE IF NOT EXISTS best_runs (
user TEXT NOT NULL,
day INTEGER NOT NULL,
part INTEGER NOT NULL,
best_time INTEGER NOT NULL,
/* if the run is deleted, also delete it from the best_runs */
run_id INTEGER NOT NULL REFERENCES runs (run_id) ON DELETE CASCADE
CONSTRAINT UNIQUE (day, part, best_time, user)
) STRICT;
CREATE TABLE IF NOT EXISTS runs (
run_id INTEGER PRIMARY KEY NOT NULL AUTOINCREMENT,
user TEXT NOT NULL,
code TEXT NOT NULL, /* submitted code, should probably be a compressed BLOB tbh */
day INTEGER NOT NULL,
part INTEGER NOT NULL,
run_time INTEGER NOT NULL, /* either ns or ps resolution, initially used ns */
answer TEXT NOT NULL
) STRICT;
/*
table for runs that were deleted from the runs table as they had wrong answers
its a little inefficient to delete from the runs table but it makes the leaderboard schema simpler
and we should also move to a tableref for the submissions code, to avoid penalties of moving them
*/
CREATE TABLE IF NOT EXISTS invalid_runs (
original_id INTEGER NOT NULL,
user TEXT NOT NULL,
code TEXT NOT NULL,
day INTEGER NOT NULL,
part INTEGER NOT NULL
) STRICT;
/* replaced solutions table with general table of inputs that might have solutions */
CREATE TABLE IF NOT EXISTS inputs (
day INTEGER NOT NULL,
/* provides an ordering of inputs per day */
local_id INTEGER NOT NULL,
input BLOB NOT NULL,
/* the validated answer is an optional row up until verified outputs are available */
answer_p1 TEXT,
answer_p2 TEXT,
CONSTRAINT UNIQUE (day, local_id)
) STRICT;
agh, theres problems with this design too, ill get to it later
I tend to stay away from ORM's, since its the first thing to go when performance is not good, sqlite has very forgiving typesignatures anyways
Fair enough, I've worked in environments where that sort of manual control was crucial to operations.
This does not change my recommendations in the slightest. If you're curious why, I'd recommend this chapter of the tutorial specifically or the SQLAlchemy Core reference.
so ive realized i forgot to add a year column, probably gonna add that and day+part could (should?) be compressed, its only 50 states so packing it to a single INTEGER would save us 2 bytes, indexing performance would slightly increase, and we have less chance of messing up schema defs (since day+part are now intrinsically linked)
The iterations keep iterating!
/*
This will be used to lookup runs for the leaderboard,
it is a mirror of data already in `runs`
*/
CREATE TABLE IF NOT EXISTS best_runs (
user TEXT NOT NULL,
/*
adding a year field, we could also pack this into day_part and
even compress it such that 2015 = 0, 2016 = 1 and so on (the "AOC epoch")
if packed with day_part, all 2015 to 2019 identifiers become one byte, and 2 bytes
gets us over 1310 years of AOC ! (this also means normally we store 3 bytes total
instead of the current 2 bytes taken by day-part, and 3 bytes by year (assuming unpacked repr))
*/
year INTEGER NOT NULL
/* packing day and part into one int, this also happens to be efficient to unpack */
day_part INTEGER NOT NULL,
best_time INTEGER NOT NULL,
/* if the run is deleted, also delete it from the best_runs */
run_id INTEGER NOT NULL REFERENCES runs (run_id) ON DELETE CASCADE
CONSTRAINT UNIQUE (year, daypart, best_time, user)
) STRICT;
CREATE TABLE IF NOT EXISTS runs (
run_id INTEGER PRIMARY KEY NOT NULL AUTOINCREMENT,
user TEXT NOT NULL,
code TEXT NOT NULL, /* submitted code, should probably be a compressed BLOB tbh */
year INTEGER NOT NULL,
day_part INTEGER NOT NULL,
run_time INTEGER NOT NULL, /* either ns or ps resolution, initially used ns */
answer TEXT NOT NULL
) STRICT;
/*
table for runs that were deleted from the runs table as they had wrong answers
its a little inefficient to delete from the runs table but it makes the leaderboard schema simpler
and we should also move to a tableref for the submissions code, to avoid penalties of moving them
*/
CREATE TABLE IF NOT EXISTS invalid_runs (
original_id INTEGER NOT NULL,
user TEXT NOT NULL,
code TEXT NOT NULL,
year INTEGER NOT NULL,
day_part INTEGER NOT NULL
) STRICT;
/* replaced solutions table with general table of inputs that might have solutions */
CREATE TABLE IF NOT EXISTS inputs (
year INTEGER NOT NULL,
/* dont use day-part here since inputs are the same for one day */
day INTEGER NOT NULL,
/* provides an ordering of inputs per day */
local_id INTEGER NOT NULL,
input BLOB NOT NULL,
/* the validated answer is an optional row up until verified outputs are available */
answer_p1 TEXT,
answer_p2 TEXT,
CONSTRAINT UNIQUE (year, day, local_id)
) STRICT;
I have some questions, but do you want them now or after you're done iterating?
I have some questions, but do you want them now or after you're done iterating?
Bring them as you find them, im probably gonna keep iterating as we define more goals/features
Some non leaderboard related stuff:
CREATE TABLE IF NOT EXISTS guild_config (
guild_id TEXT NOT NULL,
config_name TEXT NOT NULL,
/*
Not making this a BLOB potentially caps us from using msgpack if we need to,
but i dont see us needing such complicated config? config_name='prefix' config_value='='
is an expected use, and if we need a more advanced system down the road we can still use json
leaving open to discussion
*/
config_value TEXT NOT NULL,
CONSTRAINT UNIQUE (guild_id, config_name)
) STRICT;
Sorry, kept forgetting to come back to this. Thoughts:
is_valid
field in the existing table (possibly throw in an is_validated
field)?The guild_config
table is pretty much exactly as I would expect. Kudos.
Do you have any thoughts around managing schema changes? I'm very partial to migration tools, but maybe there's a lower-ceremony way of handling schema changes?
You have separate fields for year and daypart. Why is this middle-ground better than splitting all 3 fields (year, day, part), or a single combined field for yeardaypart?
It is more performant to have year-day-part as one single item, but i held off on applying it immediately since as we pack more data it becomes less and less human readable and understandable, and unless we apply year squashing as well (where 0 becomes 2015, 1 becomes 2016 etc "AOC Epoch") it doesn't save much in terms of storage
Why is having a dedicated table for invalid runs better than having an
is_valid
field in the existing table (possibly throw in anis_validated
field)?
I had a thought of having the best_runs table self manage itself in sql as much as possible (the ON DELETE CASCADE
in it means removing a run from the runs table automatically invalidates it from best_runs) but considering we actually do need to do alot of best_runs management on the code side anyways after considering it more, it is probably best to just nuke the table and use an is_valid field yeah
The inputs table has a blob. Is this for the raw input file? I thought DBMSs were kinda bad at large values? How is this better than keeping the input file on the filesystem?
The problem generally is when you are doing row scans and need to skip over a large amount of data, to amend this what i normally do is use a sacrificial table that is basically (primary_key, datablob), and then store that primary key lookup in the main table, for the inputs table this doesn't really matter since we already have a year-day btree to lookup the exact row we need (and crucially the blob is not part of any of the indexes) Also, AOC inputs are never over ~20kb, and I think we should be compressing the inputs anyways since its all ascii and very very compressible
Where does the Local ID for an input come from? Is it a stable identifier?
The old bot has some notion of Input 1/2/3, this is the same (so yes a stable identifier, we can make it so the configuration of session cookies matches up with the identifiers too, i/e [sess1, sess2, sess3] matches up to inputs 1, 2, 3 which can be used for the automatic validation code)
removed local_id
and replaced it with the token key the config file uses
removed invalid_runs
and just added validity tag to runs
/*
This will be used to lookup runs for the leaderboard,
it is a mirror of data already in `runs`
*/
CREATE TABLE IF NOT EXISTS best_runs (
user TEXT NOT NULL,
/*
adding a year field, we could also pack this into day_part and
even compress it such that 2015 = 0, 2016 = 1 and so on (the "AOC epoch")
if packed with day_part, all 2015 to 2019 identifiers become one byte, and 2 bytes
gets us over 1310 years of AOC ! (this also means normally we store 3 bytes total
instead of the current 2 bytes taken by day-part, and 3 bytes by year (assuming unpacked repr))
*/
year INTEGER NOT NULL
/* packing day and part into one int, this also happens to be efficient to unpack */
day_part INTEGER NOT NULL,
best_time INTEGER NOT NULL,
/* if the run is deleted, also delete it from the best_runs */
run_id INTEGER NOT NULL REFERENCES runs (run_id) ON DELETE CASCADE
CONSTRAINT UNIQUE (year, daypart, best_time, user)
) STRICT;
CREATE TABLE IF NOT EXISTS runs (
run_id INTEGER PRIMARY KEY NOT NULL AUTOINCREMENT,
user TEXT NOT NULL,
code TEXT NOT NULL, /* submitted code, should probably be a compressed BLOB tbh */
year INTEGER NOT NULL,
day_part INTEGER NOT NULL,
run_time INTEGER NOT NULL, /* either ns or ps resolution, initially used ns */
answer TEXT NOT NULL,
/* whether or not the runs answer was valid, treated as boolean */
valid INTEGER NOT NULL DEFAULT ( 1 ),
) STRICT;
/* replaced solutions table with general table of inputs that might have solutions */
CREATE TABLE IF NOT EXISTS inputs (
year INTEGER NOT NULL,
/* dont use day-part here since inputs are the same for one day */
day INTEGER NOT NULL,
/* provides an ordering of inputs per day */
session_label TEXT NOT NULL,
/* as mentioned above, we should be compressing this somehow */
input BLOB NOT NULL,
/* the validated answer is an optional row up until verified outputs are available */
answer_p1 TEXT,
answer_p2 TEXT,
CONSTRAINT UNIQUE (year, day, session_label)
) STRICT;
Removed some development comments, and appended the GLS stuff, im gonna spin up a sqlite instance with this schema and see what needs to be tweaked in regards to indexing
/* ensures REFERENCES fields are valid */
PRAGMA foreign_keys = ON;
/*
This will be used to lookup runs for the leaderboard,
it is a mirror of data already in `runs`
*/
CREATE TABLE IF NOT EXISTS best_runs (
user TEXT NOT NULL,
year INTEGER NOT NULL,
/* packing day and part into one int, this also happens to be efficient to unpack */
day_part INTEGER NOT NULL,
best_time INTEGER NOT NULL,
run_id INTEGER NOT NULL REFERENCES runs (run_id),
CONSTRAINT best_runs_index UNIQUE (year, day_part, best_time, user)
) STRICT;
CREATE TABLE IF NOT EXISTS runs (
run_id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
user TEXT NOT NULL,
code BLOB NOT NULL, /* submitted code as gzipped blob */
year INTEGER NOT NULL,
day_part INTEGER NOT NULL,
run_time INTEGER NOT NULL, /* ps resolution, originally used ns */
answer TEXT NOT NULL,
/* whether or not the runs answer was valid, treated as boolean */
valid INTEGER NOT NULL DEFAULT ( 1 ),
bencher_version INTEGER NOT NULL REFERENCES container_versions (id)
) STRICT;
CREATE INDEX IF NOT EXISTS runs_index ON runs (year, day_part, valid, user, run_time);
/* general table of inputs that might have solutions */
CREATE TABLE IF NOT EXISTS inputs (
year INTEGER NOT NULL,
/* dont use day-part here since inputs are the same for one day */
day INTEGER NOT NULL,
/* provides an ordering of inputs per day */
session_label TEXT NOT NULL,
/* gzip compressed input */
input BLOB NOT NULL,
/* the validated answer is an optional row up until verified outputs are available */
answer_p1 TEXT,
answer_p2 TEXT,
CONSTRAINT inputs_lookup UNIQUE (year, day, session_label)
) STRICT;
CREATE TABLE IF NOT EXISTS container_versions (
id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
/* rustc version used as output by `rustc --version` */
rustc_version TEXT NOT NULL,
container_version TEXT NOT NULL UNIQUE,
/*
a high level indicator of the benchmarking setup used,
this should be incremented whenever the way the bencher
benches code changes in a way that affects results
*/
benchmark_format INTEGER NOT NULL,
/*
gzipped tar archive of the default bencher workspace, including
Cargo.toml, Cargo.lock, and any rs files that were run
*/
bench_directory BLOB NOT NULL
) STRICT;
CREATE TABLE IF NOT EXISTS guild_config (
guild_id TEXT NOT NULL,
config_name TEXT NOT NULL,
config_value TEXT NOT NULL,
CONSTRAINT single_guild_config UNIQUE (guild_id, config_name)
) STRICT;
Storage for known wrong answers for an input, can be cleared when a part has a verified answer inserted
CREATE TABLE IF NOT EXISTS wrong_answers (
year INTEGER NOT NULL,
day_part INTEGER NOT NULL,
session_label TEXT NOT NULL,
answer TEXT NOT NULL,
) STRICT;
CREATE INDEX IF NOT EXISTS wrong_answers_cache ON wrong_answers (year, day_part, session_label, answer);
OOPS! code is submitted to the db many times (read: copied) for a single submission depending on the amount of inputs, and since the old bot handled verified answers with a sledgehammer of some kind keeping track of the input that was used (session_label) was not accounted for, and we forgot to account for it here up until i started implementing this impossible api
the obvious fix here is to rewrite runs a little bit to include the session_label it was run with, but that raises a question of why we are taking the best run out of any input instead of an average of the N inputs we run over (currently getting lucky is so so easy, this explains the dependency between what peoples reported median was and their actual leaderboard placings)
Im going to keep writing the schema migration (what im currently doing) but ill be back with a slighty altered design to runs, and how we aggregate results, so that we preserve each individual answer but aggregate the median times into one average
the obvious fix here is to rewrite runs a little bit to include the session_label it was run with, but that raises a question of why we are taking the best run out of any input instead of an average of the N inputs we run over (currently getting lucky is so so easy, this explains the dependency between what peoples reported median was and their actual leaderboard placings)
Actually, I'd say that Submission and Run should be two separate tables. Run can then reference Input and Submission, and we can track extra data/stats about each submission in a single spot (possibly starting with some timestamps around processing states).
Yeah i was thinking of something like that, just finished adding some typehints (read: sanity destroyer) so ill need to get to designing that stuff tomorrow though
After revision (now has submissions and runs table separated)
/* ensures REFERENCES fields are valid */
PRAGMA foreign_keys = ON;
/*
This will be used to lookup runs for the leaderboard,
it is a mirror of data already in `runs`
*/
CREATE TABLE IF NOT EXISTS best_runs (
user TEXT NOT NULL,
year INTEGER NOT NULL,
/* packing day and part into one int, this also happens to be efficient to unpack */
day_part INTEGER NOT NULL,
best_time INTEGER NOT NULL,
run_id INTEGER NOT NULL REFERENCES submissions (submission_id),
CONSTRAINT best_runs_index UNIQUE (year, day_part, best_time, user)
) STRICT;
CREATE TABLE IF NOT EXISTS submissions (
submission_id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
user TEXT NOT NULL,
year INTEGER NOT NULL,
day_part INTEGER NOT NULL,
/*
used on leaderboard, ps resolution, this is an aggregate of benchmark_runs.average_time
initially it is NULL before all benchmark_runs are complete
*/
average_time INTEGER DEFAULT NULL,
/* gzipped code submission */
code BLOB NOT NULL,
/* whether or not this run is considered valid, treated as bool */
valid INTEGER NOT NULL DEFAULT ( 1 ),
submitted_at INTEGER NOT NULL DEFAULT ( UNIXEPOCH() ),
bencher_version INTEGER NOT NULL REFERENCES container_versions (id)
) STRICT;
CREATE INDEX IF NOT EXISTS submissions_index ON submissions (year, day_part, valid, user, average_time);
CREATE TABLE IF NOT EXISTS benchmark_runs (
submission INTEGER NOT NULL REFERENCES submissions (submission_id),
/* the session_label of the input that this was run on */
session_label TEXT NOT NULL,
average_time INTEGER NOT NULL,
answer TEXT NOT NULL,
completed_at INTEGER NOT NULL DEFAULT ( UNIXEPOCH() )
)
/* by also placing the answer here, we can use a covering index when we are just doing answer checking */
CREATE INDEX IF NOT EXISTS benchmark_runs_index ON benchmark_runs (submission, session_label, answer);
/* general table of inputs that might have solutions */
CREATE TABLE IF NOT EXISTS inputs (
year INTEGER NOT NULL,
/* dont use day-part here since inputs are the same for one day */
day INTEGER NOT NULL,
/* provides an ordering of inputs per day */
session_label TEXT NOT NULL,
/* gzip compressed input */
input BLOB NOT NULL,
/* the validated answer is an optional row up until verified outputs are available */
answer_p1 TEXT,
answer_p2 TEXT,
CONSTRAINT inputs_lookup UNIQUE (year, day, session_label)
) STRICT;
CREATE TABLE IF NOT EXISTS container_versions (
id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
/* rustc version used as output by `rustc --version` */
rustc_version TEXT NOT NULL,
container_version TEXT NOT NULL UNIQUE,
/*
a high level indicator of the benchmarking setup used,
this should be incremented whenever the way the bencher
benches code changes in a way that affects results
*/
benchmark_format INTEGER NOT NULL,
/*
gzipped tar archive of the default bencher workspace, including
Cargo.toml, Cargo.lock, and any rs files that were run
*/
bench_directory BLOB NOT NULL
) STRICT;
CREATE TABLE IF NOT EXISTS guild_config (
guild_id TEXT NOT NULL,
config_name TEXT NOT NULL,
config_value TEXT NOT NULL,
CONSTRAINT single_guild_config UNIQUE (guild_id, config_name)
) STRICT;
CREATE TABLE IF NOT EXISTS wrong_answers (
year INTEGER NOT NULL,
day_part INTEGER NOT NULL,
session_label TEXT NOT NULL,
answer TEXT NOT NULL,
) STRICT;
CREATE INDEX IF NOT EXISTS wrong_answers_cache ON wrong_answers (year, day_part, session_label, answer);
CREATE INDEX IF NOT EXISTS submissions_index ON submissions (year, day_part, valid, user, average_time);
Does column ordering in an index matter? If it does, does putting valid
last have any significant impact?
column ordering matters in an index, it determines what gets compared first in the btree, by putting valid next to year and day_part, queries for the leaderboard cache can also only include valid submissions during the initial btree scan
Ill run a query plan to validate if changing where valid
is affects perf, though i assume it does
Sorry, I was not fully awake when I reviewed these indexes. User and average_time are less useful than valid
, even for internal analytics & answer checking. I'm now wondering if we want user
and average_time
in the index at all (if there's too many other queries going on, it may be worth making everyone suffer by reading the index and then reading the table for the full row, vs having a covering index for just the LB). That said, it's easy enough to change indexes with db migration tooling once we have that setup, so if there's no easy answer to have now, we can punt and figure this out later.
this is with the current index, it does the search with valid as a key, though i suspect we may be able to place it after user aswell, and double the use of the index to include all submissions from X user being fast
sqlite> EXPLAIN QUERY PLAN SELECT user, MIN(average_time) FROM submissions WHERE (year = ? AND day_part = ? AND valid = 1) GROUP BY user ORDER BY average_time;
QUERY PLAN
|--SEARCH submissions USING COVERING INDEX submissions_index (year=? AND day_part=? AND valid=?)
`--USE TEMP B-TREE FOR ORDER BY
This is after moving valid after user but before average time, as i suspected it doesn't effect this individual query, because it scans every user range and only needs to find the valid half, it doesn't use valid in the initial scan now though
sqlite> EXPLAIN QUERY PLAN SELECT user, MIN(average_time) FROM submissions WHERE (year = ? AND day_part = ? AND valid = 1) GROUP BY user ORDER BY average_time;
QUERY PLAN
|--SEARCH submissions USING COVERING INDEX submissions_index (year=? AND day_part=?)
`--USE TEMP B-TREE FOR ORDER BY
Surprisingly placing valid last still builds the same query plan, it probably just scans for the min time within each user until it finds one that is valid, NGQP does black magic, but i think this is technically slower (i/e if a user has a ton of invalid runs that are the fastest, the engine has to keep searching from the min time until it finds a valid run)
sqlite> EXPLAIN QUERY PLAN SELECT user, MIN(average_time) FROM submissions WHERE (year = ? AND day_part = ? AND valid = 1) GROUP BY user ORDER BY average_time;
QUERY PLAN
|--SEARCH submissions USING COVERING INDEX submissions_index (year=? AND day_part=?)
`--USE TEMP B-TREE FOR ORDER BY
i tried running an EXPLAIN on the last two indexes i tested to see what the differences were, but there were none i could see with the help of diff
i thought this would be funny and it was, because the last field of an index is a hidden rowid/primary key, we get primary keys in the covering index lol
sqlite> EXPLAIN QUERY PLAN SELECT submission_id, user, MIN(average_time) FROM submissions WHERE (year = ? AND day_part = ? AND valid = 1) GROUP BY user ORDER BY average_time;
QUERY PLAN
|--SEARCH submissions USING COVERING INDEX submissions_index (year=? AND day_part=? AND valid=?)
`--USE TEMP B-TREE FOR ORDER BY
OK, sounds like it doesn't make too much of a difference for now, but what you've got is probably correct. Carry on.
This is a tracking issue for designs for the storage solution, ranging from tools to use to specific schemas and constraints we can exploit to get high performance under any load