Migrating Modlog to Postgres

thejetou commented 4 years ago

Context

The current Modlog setup of using text files and ripgrep to meet the search criteria is sluggish and has been the result of numerous complaints.

Solution

Databases! It has shown to work in PM logs, from taking ~5 minutes to being instantaneous. It is obviously not a 'silver bullet' though, like everything it requires optimizations and monitoring to ensure it's meeting requirements.

Why Postgres?

While SQLite is easier to set up (it's just a file), I think Postgres is a better choice for a couple reasons

The server will use the same database engine as the client (see: https://github.com/smogon/pokemon-showdown-client/issues/1423)
SQLite is meant to be a 'lightweight' database, intended for use in embedded devices (or mobile phones), so scaling Postgres higher should in theory be easier.

Proposed Schema

CREATE TABLE modlog (
    -- UNIX timestamp
    -- Integer technically works here, but would break on 2038
    timestamp bigint NOT NULL
    room_id TEXT NOT NULL
    -- ROOMBAN, NOTE etc.
    action TEXT NOT NULL
    user_id TEXT
    autoconfirmed TEXT
    alts TEXT[],
    ip TEXT,
    logged_by TEXT,
    note TEXT
);

-- We use `WHERE roomid = $1` _every_ single time, so an index seems worth it
CREATE INDEX ml_index ON modlog(room_id)

Proposed Query

SELECT *
FROM modlog
WHERE
    room_id = $1
    AND
    concat_ws(" ", time_stamp, action, user_id, autoconfirmed, array_to_string(autoconfirmed, " "), alts, ip, logged_by, note) LIKE $2
ORDER BY timestamp DESC
LIMIT $3

Of course, concatenation could in theory be expensive and thus make this query slow, but I don't think making premature optimization is worth it. If it does turn out to be too slow, we could rewrite the modlog command such that it takes a filter for each query and we don't have to concatenate all of them.

Conversion

Create a tools/database-converters where we support converting modlog from the old format (with the script @sparkychildcharlie wrote) to the new format, converting the new one to Postgres, and vice versa only for the latter.

\cc @Zarel @scheibo

Zarel commented 4 years ago

@scheibo So I finally had time to work on database migrations, and talked them over with JetOU. JetOU thought that migrating to SQLite was unnecessary and he wanted to migrate to Postgres instead; I said I was tentatively okay with it, but it had to be done on Modlog and not Punishments because I didn't want Punishments on Postgres.

scheibo commented 4 years ago

SQLite vs. Postrgres

I have nothing against going right to Postgres, but SQLite can easily handle modlogs 'scale'. PM logs are considerably larger and are plenty fast on SQLite (though theyre not doing stupid table scans lmao).

What happened to making the persistence pluggable? I'm still fairly confident theres a level of abstraction where Postgres vs. Sqlite vs. Files is just configurable...

Proposed Query

Yeah... this effectively is getting us pretty much zero value over flat files and would in fact likely be slower. If you actually want to improve modlog performance you want to add a better index than just room id and actually use it, though this is going to probably involve changing how the modlog search command is structured

I didn't want Punishments on Postgres.

Why?

Zarel commented 4 years ago

I have nothing against going right to Postgres, but SQLite can easily handle modlogs 'scale'. PM logs are considerably larger and are plenty fast on SQLite (though theyre not doing stupid table scans lmao).

I told him that, but he said it was just nicer to have all the login and sim stuff in a centralized database, which I guess I don't disagree with. Would be nice to have modlog in userpages.

Proposed Query

I didn't see that. Yeah, that query isn't going to be fast.

Maybe start off with some username queries?

Why [not Punishments on Postgres]?

I preferred a migration that would make more of a difference. I think Punishments on Postgres might actually make it slower (because it's going over the network) rather than being local.

monsanto commented 4 years ago

My intuition is that SQLite will be faster than Postgres for a database of this size. The reason is that a Postgres query always involves inter-process communication (incl. serialization overheads), whereas SQLite operates in-process.

Other advantages of SQLite over Postgres include:

better cross-platform story
easier to write tests for (although it's also easy enough to write tests for Postgres using testcontainers)
easier to administer, no separate Postgres service

Some of this would be mitigated by having pluggable persistence but then you run into the issue of potentially running different code in testing than in production.

As for the schema/query

Use an appropriate date/time type instead of bigint
You need a multi-column index on room_id + timestamp, since you are sorting on timestamp
I agree with @scheibo, your proposed query will probably be slower than a flat file for IPC overhead alone.

Zarel commented 4 years ago

Note that the Postgres query would be over-network (it would be hosted on a different server). It's more than just IPC overhead.

Zarel commented 4 years ago

Also note that jetou said he didn't care about "better cross-platform story" because he planned to continue maintaining flat files for side servers.

thejetou commented 4 years ago

What happened to making the persistence pluggable? I'm still fairly confident theres a level of abstraction where Postgres vs. Sqlite vs. Files is just configurable...

Yes, I do plan to make it configurable, but you still need to write code to support the option.

Also, RE: SQLite vs Postgres

The main reason I want Postgres, is as Zarel said, to share the same database with the client. Though over network doesn't sound too nice, so I might have to concede..

OK, schema and query, take two!

Schema:

CREATE TABLE modlog (
    -- UNIX timestamp
    timestamp timestampz NOT NULL,
    room_id TEXT NOT NULL,
    -- ROOMBAN, NOTE etc.
    action TEXT NOT NULL,
    user_id TEXT,
    autoconfirmed TEXT,
    alts TEXT[],
    ip TEXT,
   -- Naming might be a bit poor here
    logged_by TEXT,
    note TEXT
);

CREATE INDEX ml_index_1 ON modlog(room_id, timestamp);
-- If we're searching by user, we're going to look up all of these cols
CREATE INDEX ml_index_2 ON modlog(user_id, alts, autoconfirmed);
-- Optimizing the full text search
CREATE INDEX ml_index_3 USING GIN ON modlog(tsvector(note));

Query:

SELECT *
FROM modlog
WHERE
    room_id = $1 AND $2 = ANY(alts)
    AND to_tsvector(note) @@ to_tsquery($3);
ORDER BY timestamp DESC
LIMIT $4

NOTE: This will make the modlog command's syntax be: /modlog [roomid], user=[user], ip=[ip], note=[note], instead of a full text search of the entire log.

Zarel commented 4 years ago

Do Postgres/SQLite support fulltext search? If so, maybe we could have a text column containing all the parameters, and then fulltext search that?

Zarel commented 4 years ago

The type integer is the common choice, as it offers the best balance between range, storage size, and performance. The smallint type is generally only used if disk space is at a premium. The bigint type is designed to be used when the range of the integer type is insufficient.

Why is Postgres still recommending 32-bit ints in literal 2020?

thejetou commented 4 years ago

Do Postgres/SQLite support fulltext search?

It does, I'm using it for note.

If so, maybe we could have a text column containing all the parameters, and then fulltext search that?

Like, keep all the columns, but have one for the full log for searching? I'm currently using full text search only for note, but I suppose that could work.

Zarel commented 4 years ago

Like, keep all the columns, but have one for the full log for searching? I'm currently using full text search only for note, but I suppose that could work.

Yeah, that's what I was thinking, but I'm not a database person. I'd want @scheibo or @monsanto to weigh in on whether or not that's a good idea.

thejetou commented 4 years ago

Bump.

AnnikaCodes commented 4 years ago

I'm going to work on this. I agree with chaos that we should use SQLite rather than Postgres for this; less overhead is good here. @TheJetOU, I know you worked on this. Did you have some WIP design that you would be okay with sharing, besides the schema below?

Schema:

CREATE TABLE modlog (
    -- UNIX timestamp
    timestamp timestampz NOT NULL,
    room_id TEXT NOT NULL,
    -- ROOMBAN, NOTE etc.
    action TEXT NOT NULL,
    user_id TEXT,
    autoconfirmed TEXT,
    alts TEXT[],
    ip TEXT,
   -- Naming might be a bit poor here
    logged_by TEXT,
    note TEXT
);

CREATE INDEX ml_index_1 ON modlog(room_id, timestamp);
-- If we're searching by user, we're going to look up all of these cols
CREATE INDEX ml_index_2 ON modlog(user_id, alts, autoconfirmed);
-- Optimizing the full text search
CREATE INDEX ml_index_3 USING GIN ON modlog(tsvector(note));

Query:

SELECT *
FROM modlog
WHERE
    room_id = $1 AND $2 = ANY(alts)
    AND to_tsvector(note) @@ to_tsquery($3);
ORDER BY timestamp DESC
LIMIT $4

AnnikaCodes commented 4 years ago

tbh the schema looks ok to me - AFAIK tsvector is Postgres only, so we'd have to use something else for searching notes. @monsanto, your thoughts? Would a WHERE ... LIKE query be too slow here?

There's also the question of whether the user=, ip=, note=, action= etc syntax would be too confusing to staff (I think it's fine, although /modlog (text) with no specific parameter could perhaps use some intelligence to guess what field is being searched (e.g. queries matching the IP regex are probably IPs, queries in ALL CAPS are probably actions, queries <19 characters might be usernames, and so on), which would also make the transition smoother for staff.

Also, I know JetOU wanted to continue maintaining flat files, which is okay I guess, but I feel like sqlite isn't that much more work for side servers (since, unlike Postgres, it doesn't require maintaining a separate service).

monsanto commented 4 years ago

@AnnikaCodes I don't think this PR reflects the current state of jetou's work. From our discussions:

We agreed no flat files, no point in maintaining two code paths
I suggested using LIKE for substring search or streaming the rows to JS and doing post-processing.
better-sqlite3 instead of node-sqlite3 which is used for the PM log
His last schema had indexes

CREATE INDEX ml_index_1 ON modlog(timestamp, room_id);
CREATE INDEX ml_index_2 ON modlog(action);
CREATE INDEX ml_index_3 ON modlog(action_taker);
CREATE INDEX ml_index_4 ON modlog(user_id, autoconfirmed);
CREATE INDEX ml_index_5 ON modlog(ip);

His last query was

    -- /modlog [roomid], action=[action], actiontaker=[userid], user=[userid], ip=[ip], note=[note]
    -- Alts and notes will be filtered via streaming to JavaScript
    SELECT *
    FROM modlog
    WHERE
        (room_id = $1 OR roomid = 'global-' || $1) AND
        action = $2 AND
        action_taker = $3 AND
        (user_id = $4 OR autoconfirmed = $4) AND
        ip = $5
    ORDER BY timestamp DESC

    -- /modlog global, action=[action], actiontaker=[userid], user=[userid], ip=[ip], note=[note]
    SELECT *
    FROM modlog
    WHERE
        (roomid = 'global-%' OR roomid = 'global') AND
        action = $2 AND
        action_taker = $3 AND
        (user_id = $4 OR autoconfirmed = $4) AND
        ip = $5
    ORDER BY timestamp DESC

I didn't necessarily give the go-ahead to these, there was a lot of discussion about indexing and query planning which I'm not sure will carry over to your work. You may want to look it over and see what you agree/disagree with before giving me a new proposal.
I also gave him some modlogs that I can give you too.

There's also the question of whether the user=, ip=, note=, action= etc syntax would be too confusing to staff (I think it's fine, although /modlog (text) with no specific parameter could perhaps use some intelligence to guess what field is being searched (e.g. queries matching the IP regex are probably IPs, queries in ALL CAPS are probably actions, queries <19 characters might be usernames, and so on), which would also make the transition smoother for staff.

I don't use modlog very much so you may want to see what the rest of the staff regularly searches for. This will help design the indexes too.

AnnikaCodes commented 4 years ago

CREATE INDEX ml_index_3 ON modlog(action_taker);

What is action_taker? Is that the same as logged_by?

(room_id = $1 OR roomid = 'global-' || $1)

I was thinking of a separate boolean (or rather tinyint since SQLite doesn't support real booleans) to represent whether a modlog entry should be in the global modlog (since something can be in a room modlog and the global modlog).

The alts field is also something to be considered - a user can have multiple alts marked in the modlog when they're punished and SQLite as far as I know doesn't support list datatypes, so we'd have to do something like set alts to alt1,alt2,alt3 and then do WHERE alts LIKE '%' || $1 || '%'.

Other than that jetou's last setup looks good.

I don't use modlog very much so you may want to see what the rest of the staff regularly searches for. This will help design the indexes too.

Sure, is this an appropriate thing to post in the staff forum on Smogon or should I do it more informally?

monsanto commented 4 years ago

What is action_taker? Is that the same as logged_by?

I believe so

The alts field is also something to be considered - a user can have multiple alts marked in the modlog when they're punished and SQLite as far as I know doesn't support list datatypes, so we'd have to do something like set alts to alt1,alt2,alt3 and then do WHERE alts LIKE '%' || $1 || '%'.

I'd add a modlog_id field to modlog, and have another table alts that maps modlog_id -> alt

Sure, is this an appropriate thing to post in the staff forum on Smogon or should I do it more informally?

Whatever you would feel is most helpful to you

thejetou commented 4 years ago

@AnnikaCodes - https://github.com/TheJetOU/Pokemon-Showdown/tree/modlog-sqlite

IDK how well up to date this with PS code since I wrote it four months ago.

What is action_taker? Is that the same as logged_by?

Yes.

Thanks for carrying my will forward and good luck! :).

AnnikaCodes commented 4 years ago

Thanks! That looks like a good start. Like I said, I'd prefer to stop supporting FS modlogs because of the extra maintenance burden. (Also, I'd rather make modlog writes take an object with optional properties to avoid the modlog(undefined, undefined, message) thing.)

@monsanto - you meant modlog_id be a primary key type thing, incrementing on each entry, correct? (i.e. not a userid) Also it seems a bit overkill to me to have an entire separate table for alts (which would make searching more complicated), but if you think it's the best I can work on it. (We'd have to either LEFT JOIN and then combine duplicate records or run a separate query to get alts.)

monsanto commented 4 years ago

you meant modlog_id be a primary key type thing, incrementing on each entry, correct? (i.e. not a userid)

Yes

Also it seems a bit overkill to me to have an entire separate table for alts (which would make searching more complicated)

This is just how you store arrays in SQL.

We'd have to either LEFT JOIN and then combine duplicate records or run a separate query to get alts.

Use an EXISTS subquery to test for an alt.

AnnikaCodes commented 4 years ago

So far this is what I have for a query (alternately with (roomid = 'global' OR roomid LIKE 'global-%') for searching global modlogs, just like jetou):

SELECT *, (SELECT group_concat(userid, ',') FROM alts WHERE entryid = modlog.entryid) as alts
FROM modlog
WHERE
    roomid IN ($1, $2) AND
    action = $3 AND
    (
        userid REGEXP $4 OR
        autoconfirmed_userid REGEXP $4 OR
        EXISTS(SELECT * FROM alts WHERE entryid = modlog.entryid AND userid REGEXP $4)
    ) AND
    ip REGEXP $5 AND
    action_taker REGEXP $6 AND
    note REGEXP $7
ORDER BY timestamp DESC
LIMIT $8

And for a schema:

CREATE TABLE modlog (
    entryid INTEGER NOT NULL PRIMARY KEY,
    timestamp INTEGER NOT NULL,
    roomid TEXT NOT NULL,
    action TEXT NOT NULL,
    visual_roomid TEXT,
    action_taker TEXT,
    userid TEXT,
    autoconfirmed_userid TEXT,
    ip TEXT,
    note TEXT
);
CREATE TABLE alts (
    entryid INTEGER NOT NULL,
    userid TEXT NOT NULL
);
CREATE INDEX ml_index_1 ON modlog(timestamp);
CREATE INDEX ml_index_2 ON modlog(roomid, timestamp);
CREATE INDEX ml_index_3 ON modlog(action, timestamp);
CREATE INDEX ml_index_4 ON modlog(action_taker, timestamp);
CREATE INDEX ml_index_5 ON modlog(userid, timestamp);
CREATE INDEX ml_index_6 ON modlog(autoconfirmed_userid, timestamp);
CREATE INDEX ml_index_7 ON modlog(ip, timestamp);

AnnikaCodes commented 4 years ago

Hm actually, I think it might be better to sort by entryid because of SQLite's ROWID and the corresponding performance increase. That'd also eliminate the need for all the indices on timestamp.

monsanto commented 4 years ago

Hm actually, I think it might be better to sort by entryid because of SQLite's ROWID and the corresponding performance increase. That'd also eliminate the need for all the indices on timestamp.

I wouldn't try to microoptimize the schema. There isn't a guarantee that a larger entryid implies a later timestamp. Even if you restricted your SQLite usage enough to preserve this invariant you are ruling out restoring old modlog data after deployment.

AnnikaCodes commented 4 years ago

I wouldn't try to microoptimize the schema. There isn't a guarantee that a larger entryid implies a later timestamp. Even if you restricted your SQLite usage enough to preserve this invariant you are ruling out restoring old modlog data after deployment.

Good point.. Does the rest of the schema look OK?

monsanto commented 4 years ago

REGEXP

The REGEXP operator is a special syntax for the regexp() user function. No regexp() user function is defined by default and so use of the REGEXP operator will normally result in an error message. If an application-defined SQL function named "regexp" is added at run-time, then the "X REGEXP Y" operator will be implemented as a call to "regexp(Y,X)".

I'd just explicitly use the function you want to use instead of relying on this sugar.

CREATE INDEX ml_index_4 ON modlog(action_taker, timestamp); CREATE INDEX ml_index_5 ON modlog(userid, timestamp); CREATE INDEX ml_index_6 ON modlog(autoconfirmed_userid, timestamp); CREATE INDEX ml_index_7 ON modlog(ip, timestamp);

These will never be used unless you special case exact matches when generating SQL; the query-planner won't be able to look inside your user-defined JS function.

visual_roomid TEXT,

What is this?

entryid INTEGER NOT NULL PRIMARY KEY,

I generally prefer to call these like <table>_id so you can use USING in joins without worrying about column ambiguity. If you're going to name it something generic i'd just call it id.

roomid IN ($1, $2) AND action = $3 AND

If you intend to have both of these use an index you need a multicolumn index like (roomid, action, timestamp). I don't know how worthwhile this is. I also don't know how worthwhile it is to have an (action, timestamp) index. You should think carefully about which parts of the query are expensive and focus on indexing that. If you want a flexible search interface you won't be able to index everything.

AnnikaCodes commented 4 years ago

What is this?

visual_roomid is used for cases where the ID to search for is different from the name to be displayed. Currently, the only use case is room tournaments, where the roomid will be, for example, battle-gen8randombattle-1 but the visual_roomid will be battle-gen8randombattle-1 tournament: lobby.

I don't think actions are very common search terms. Probably (roomid, userid, timestamp) and (roomid, ip, timestamp) -- those seem like the most common search combos.

The other suggestions seem fine, I'll implement them.

smogon / pokemon-showdown