Open thejetou opened 4 years ago
@scheibo So I finally had time to work on database migrations, and talked them over with JetOU. JetOU thought that migrating to SQLite was unnecessary and he wanted to migrate to Postgres instead; I said I was tentatively okay with it, but it had to be done on Modlog and not Punishments because I didn't want Punishments on Postgres.
SQLite vs. Postrgres
I have nothing against going right to Postgres, but SQLite can easily handle modlogs 'scale'. PM logs are considerably larger and are plenty fast on SQLite (though theyre not doing stupid table scans lmao).
What happened to making the persistence pluggable? I'm still fairly confident theres a level of abstraction where Postgres vs. Sqlite vs. Files is just configurable...
Proposed Query
Yeah... this effectively is getting us pretty much zero value over flat files and would in fact likely be slower. If you actually want to improve modlog performance you want to add a better index than just room id and actually use it, though this is going to probably involve changing how the modlog search command is structured
I didn't want Punishments on Postgres.
Why?
I have nothing against going right to Postgres, but SQLite can easily handle modlogs 'scale'. PM logs are considerably larger and are plenty fast on SQLite (though theyre not doing stupid table scans lmao).
I told him that, but he said it was just nicer to have all the login and sim stuff in a centralized database, which I guess I don't disagree with. Would be nice to have modlog in userpages.
Proposed Query
I didn't see that. Yeah, that query isn't going to be fast.
Maybe start off with some username queries?
Why [not Punishments on Postgres]?
I preferred a migration that would make more of a difference. I think Punishments on Postgres might actually make it slower (because it's going over the network) rather than being local.
My intuition is that SQLite will be faster than Postgres for a database of this size. The reason is that a Postgres query always involves inter-process communication (incl. serialization overheads), whereas SQLite operates in-process.
Other advantages of SQLite over Postgres include:
Some of this would be mitigated by having pluggable persistence but then you run into the issue of potentially running different code in testing than in production.
As for the schema/query
bigint
room_id
+ timestamp
, since you are sorting on timestamp
Note that the Postgres query would be over-network (it would be hosted on a different server). It's more than just IPC overhead.
Also note that jetou said he didn't care about "better cross-platform story" because he planned to continue maintaining flat files for side servers.
What happened to making the persistence pluggable? I'm still fairly confident theres a level of abstraction where Postgres vs. Sqlite vs. Files is just configurable...
Yes, I do plan to make it configurable, but you still need to write code to support the option.
Also, RE: SQLite vs Postgres
The main reason I want Postgres, is as Zarel said, to share the same database with the client. Though over network doesn't sound too nice, so I might have to concede..
OK, schema and query, take two!
Schema:
CREATE TABLE modlog (
-- UNIX timestamp
timestamp timestampz NOT NULL,
room_id TEXT NOT NULL,
-- ROOMBAN, NOTE etc.
action TEXT NOT NULL,
user_id TEXT,
autoconfirmed TEXT,
alts TEXT[],
ip TEXT,
-- Naming might be a bit poor here
logged_by TEXT,
note TEXT
);
CREATE INDEX ml_index_1 ON modlog(room_id, timestamp);
-- If we're searching by user, we're going to look up all of these cols
CREATE INDEX ml_index_2 ON modlog(user_id, alts, autoconfirmed);
-- Optimizing the full text search
CREATE INDEX ml_index_3 USING GIN ON modlog(tsvector(note));
Query:
SELECT *
FROM modlog
WHERE
room_id = $1 AND $2 = ANY(alts)
AND to_tsvector(note) @@ to_tsquery($3);
ORDER BY timestamp DESC
LIMIT $4
NOTE: This will make the modlog command's syntax be: /modlog [roomid], user=[user], ip=[ip], note=[note]
, instead of a full text search of the entire log.
Do Postgres/SQLite support fulltext search? If so, maybe we could have a text column containing all the parameters, and then fulltext search that?
The type integer is the common choice, as it offers the best balance between range, storage size, and performance. The smallint type is generally only used if disk space is at a premium. The bigint type is designed to be used when the range of the integer type is insufficient.
Why is Postgres still recommending 32-bit ints in literal 2020?
Do Postgres/SQLite support fulltext search?
It does, I'm using it for note
.
If so, maybe we could have a text column containing all the parameters, and then fulltext search that?
Like, keep all the columns, but have one for the full log for searching? I'm currently using full text search only for note
, but I suppose that could work.
Like, keep all the columns, but have one for the full log for searching? I'm currently using full text search only for
note
, but I suppose that could work.
Yeah, that's what I was thinking, but I'm not a database person. I'd want @scheibo or @monsanto to weigh in on whether or not that's a good idea.
Bump.
I'm going to work on this. I agree with chaos that we should use SQLite rather than Postgres for this; less overhead is good here. @TheJetOU, I know you worked on this. Did you have some WIP design that you would be okay with sharing, besides the schema below?
Schema:
CREATE TABLE modlog ( -- UNIX timestamp timestamp timestampz NOT NULL, room_id TEXT NOT NULL, -- ROOMBAN, NOTE etc. action TEXT NOT NULL, user_id TEXT, autoconfirmed TEXT, alts TEXT[], ip TEXT, -- Naming might be a bit poor here logged_by TEXT, note TEXT ); CREATE INDEX ml_index_1 ON modlog(room_id, timestamp); -- If we're searching by user, we're going to look up all of these cols CREATE INDEX ml_index_2 ON modlog(user_id, alts, autoconfirmed); -- Optimizing the full text search CREATE INDEX ml_index_3 USING GIN ON modlog(tsvector(note));
Query:
SELECT * FROM modlog WHERE room_id = $1 AND $2 = ANY(alts) AND to_tsvector(note) @@ to_tsquery($3); ORDER BY timestamp DESC LIMIT $4
tbh the schema looks ok to me - AFAIK tsvector
is Postgres only, so we'd have to use something else for searching notes. @monsanto, your thoughts? Would a WHERE ... LIKE
query be too slow here?
There's also the question of whether the user=
, ip=
, note=
, action=
etc syntax would be too confusing to staff (I think it's fine, although /modlog (text)
with no specific parameter could perhaps use some intelligence to guess what field is being searched (e.g. queries matching the IP regex are probably IPs, queries in ALL CAPS are probably actions, queries <19 characters might be usernames, and so on), which would also make the transition smoother for staff.
Also, I know JetOU wanted to continue maintaining flat files, which is okay I guess, but I feel like sqlite isn't that much more work for side servers (since, unlike Postgres, it doesn't require maintaining a separate service).
@AnnikaCodes I don't think this PR reflects the current state of jetou's work. From our discussions:
LIKE
for substring search or streaming the rows to JS and doing post-processing.better-sqlite3
instead of node-sqlite3
which is used for the PM logCREATE INDEX ml_index_1 ON modlog(timestamp, room_id);
CREATE INDEX ml_index_2 ON modlog(action);
CREATE INDEX ml_index_3 ON modlog(action_taker);
CREATE INDEX ml_index_4 ON modlog(user_id, autoconfirmed);
CREATE INDEX ml_index_5 ON modlog(ip);
-- /modlog [roomid], action=[action], actiontaker=[userid], user=[userid], ip=[ip], note=[note]
-- Alts and notes will be filtered via streaming to JavaScript
SELECT *
FROM modlog
WHERE
(room_id = $1 OR roomid = 'global-' || $1) AND
action = $2 AND
action_taker = $3 AND
(user_id = $4 OR autoconfirmed = $4) AND
ip = $5
ORDER BY timestamp DESC
-- /modlog global, action=[action], actiontaker=[userid], user=[userid], ip=[ip], note=[note]
SELECT *
FROM modlog
WHERE
(roomid = 'global-%' OR roomid = 'global') AND
action = $2 AND
action_taker = $3 AND
(user_id = $4 OR autoconfirmed = $4) AND
ip = $5
ORDER BY timestamp DESC
There's also the question of whether the user=, ip=, note=, action= etc syntax would be too confusing to staff (I think it's fine, although /modlog (text) with no specific parameter could perhaps use some intelligence to guess what field is being searched (e.g. queries matching the IP regex are probably IPs, queries in ALL CAPS are probably actions, queries <19 characters might be usernames, and so on), which would also make the transition smoother for staff.
I don't use modlog very much so you may want to see what the rest of the staff regularly searches for. This will help design the indexes too.
CREATE INDEX ml_index_3 ON modlog(action_taker);
What is action_taker
? Is that the same as logged_by
?
(room_id = $1 OR roomid = 'global-' || $1)
I was thinking of a separate boolean (or rather tinyint since SQLite doesn't support real booleans) to represent whether a modlog entry should be in the global modlog (since something can be in a room modlog and the global modlog).
The alts
field is also something to be considered - a user can have multiple alts marked in the modlog when they're punished and SQLite as far as I know doesn't support list datatypes, so we'd have to do something like set alts
to alt1,alt2,alt3
and then do WHERE alts LIKE '%' || $1 || '%'
.
Other than that jetou's last setup looks good.
I don't use modlog very much so you may want to see what the rest of the staff regularly searches for. This will help design the indexes too.
Sure, is this an appropriate thing to post in the staff forum on Smogon or should I do it more informally?
What is action_taker? Is that the same as logged_by?
I believe so
The alts field is also something to be considered - a user can have multiple alts marked in the modlog when they're punished and SQLite as far as I know doesn't support list datatypes, so we'd have to do something like set alts to alt1,alt2,alt3 and then do WHERE alts LIKE '%' || $1 || '%'.
I'd add a modlog_id
field to modlog
, and have another table alts
that maps modlog_id
-> alt
Sure, is this an appropriate thing to post in the staff forum on Smogon or should I do it more informally?
Whatever you would feel is most helpful to you
@AnnikaCodes - https://github.com/TheJetOU/Pokemon-Showdown/tree/modlog-sqlite
IDK how well up to date this with PS code since I wrote it four months ago.
What is action_taker? Is that the same as logged_by?
Yes.
Thanks for carrying my will forward and good luck! :).
Thanks! That looks like a good start. Like I said, I'd prefer to stop supporting FS modlogs because of the extra maintenance burden. (Also, I'd rather make modlog writes take an object with optional properties to avoid the modlog(undefined, undefined, message)
thing.)
@monsanto - you meant modlog_id
be a primary key type thing, incrementing on each entry, correct? (i.e. not a userid)
Also it seems a bit overkill to me to have an entire separate table for alts (which would make searching more complicated), but if you think it's the best I can work on it. (We'd have to either LEFT JOIN
and then combine duplicate records or run a separate query to get alts.)
you meant modlog_id be a primary key type thing, incrementing on each entry, correct? (i.e. not a userid)
Yes
Also it seems a bit overkill to me to have an entire separate table for alts (which would make searching more complicated)
This is just how you store arrays in SQL.
We'd have to either LEFT JOIN and then combine duplicate records or run a separate query to get alts.
Use an EXISTS
subquery to test for an alt.
So far this is what I have for a query (alternately with (roomid = 'global' OR roomid LIKE 'global-%')
for searching global modlogs, just like jetou):
SELECT *, (SELECT group_concat(userid, ',') FROM alts WHERE entryid = modlog.entryid) as alts
FROM modlog
WHERE
roomid IN ($1, $2) AND
action = $3 AND
(
userid REGEXP $4 OR
autoconfirmed_userid REGEXP $4 OR
EXISTS(SELECT * FROM alts WHERE entryid = modlog.entryid AND userid REGEXP $4)
) AND
ip REGEXP $5 AND
action_taker REGEXP $6 AND
note REGEXP $7
ORDER BY timestamp DESC
LIMIT $8
And for a schema:
CREATE TABLE modlog (
entryid INTEGER NOT NULL PRIMARY KEY,
timestamp INTEGER NOT NULL,
roomid TEXT NOT NULL,
action TEXT NOT NULL,
visual_roomid TEXT,
action_taker TEXT,
userid TEXT,
autoconfirmed_userid TEXT,
ip TEXT,
note TEXT
);
CREATE TABLE alts (
entryid INTEGER NOT NULL,
userid TEXT NOT NULL
);
CREATE INDEX ml_index_1 ON modlog(timestamp);
CREATE INDEX ml_index_2 ON modlog(roomid, timestamp);
CREATE INDEX ml_index_3 ON modlog(action, timestamp);
CREATE INDEX ml_index_4 ON modlog(action_taker, timestamp);
CREATE INDEX ml_index_5 ON modlog(userid, timestamp);
CREATE INDEX ml_index_6 ON modlog(autoconfirmed_userid, timestamp);
CREATE INDEX ml_index_7 ON modlog(ip, timestamp);
Hm actually, I think it might be better to sort by entryid
because of SQLite's ROWID and the corresponding performance increase. That'd also eliminate the need for all the indices on timestamp
.
Hm actually, I think it might be better to sort by entryid because of SQLite's ROWID and the corresponding performance increase. That'd also eliminate the need for all the indices on timestamp.
I wouldn't try to microoptimize the schema. There isn't a guarantee that a larger entryid implies a later timestamp. Even if you restricted your SQLite usage enough to preserve this invariant you are ruling out restoring old modlog data after deployment.
I wouldn't try to microoptimize the schema. There isn't a guarantee that a larger entryid implies a later timestamp. Even if you restricted your SQLite usage enough to preserve this invariant you are ruling out restoring old modlog data after deployment.
Good point.. Does the rest of the schema look OK?
REGEXP
The REGEXP operator is a special syntax for the regexp() user function. No regexp() user function is defined by default and so use of the REGEXP operator will normally result in an error message. If an application-defined SQL function named "regexp" is added at run-time, then the "X REGEXP Y" operator will be implemented as a call to "regexp(Y,X)".
I'd just explicitly use the function you want to use instead of relying on this sugar.
CREATE INDEX ml_index_4 ON modlog(action_taker, timestamp); CREATE INDEX ml_index_5 ON modlog(userid, timestamp); CREATE INDEX ml_index_6 ON modlog(autoconfirmed_userid, timestamp); CREATE INDEX ml_index_7 ON modlog(ip, timestamp);
These will never be used unless you special case exact matches when generating SQL; the query-planner won't be able to look inside your user-defined JS function.
visual_roomid TEXT,
What is this?
entryid INTEGER NOT NULL PRIMARY KEY,
I generally prefer to call these like <table>_id
so you can use USING
in joins without worrying about column ambiguity. If you're going to name it something generic i'd just call it id
.
roomid IN ($1, $2) AND action = $3 AND
If you intend to have both of these use an index you need a multicolumn index like (roomid, action, timestamp)
. I don't know how worthwhile this is. I also don't know how worthwhile it is to have an (action, timestamp)
index. You should think carefully about which parts of the query are expensive and focus on indexing that. If you want a flexible search interface you won't be able to index everything.
What is this?
visual_roomid
is used for cases where the ID to search for is different from the name to be displayed. Currently, the only use case is room tournaments, where the roomid
will be, for example, battle-gen8randombattle-1
but the visual_roomid
will be battle-gen8randombattle-1 tournament: lobby
.
I don't think actions are very common search terms. Probably (roomid, userid, timestamp)
and (roomid, ip, timestamp)
-- those seem like the most common search combos.
The other suggestions seem fine, I'll implement them.
Context
The current Modlog setup of using text files and ripgrep to meet the search criteria is sluggish and has been the result of numerous complaints.
Solution
Databases! It has shown to work in PM logs, from taking ~5 minutes to being instantaneous. It is obviously not a 'silver bullet' though, like everything it requires optimizations and monitoring to ensure it's meeting requirements.
Why Postgres?
While SQLite is easier to set up (it's just a file), I think Postgres is a better choice for a couple reasons
Proposed Schema
Proposed Query
Of course, concatenation could in theory be expensive and thus make this query slow, but I don't think making premature optimization is worth it. If it does turn out to be too slow, we could rewrite the modlog command such that it takes a filter for each query and we don't have to concatenate all of them.
Conversion
Create a
tools/database-converters
where we support converting modlog from the old format (with the script @sparkychildcharlie wrote) to the new format, converting the new one to Postgres, and vice versa only for the latter.\cc @Zarel @scheibo