shatteredsilicon / ssm-submodules

GNU Affero General Public License v3.0
0 stars 2 forks source link

Add monitoring for slave thread utilisation #205

Open gordan-bobic opened 9 months ago

gordan-bobic commented 9 months ago

When using multi-threaded replication, we currently have no way of establishing how many threads are being used on the slave side to apply the changes. That makes tuning the number of parallel threads and other parameters largely guesswork. To improve visibility of this, we need to sample the following (let's start with once per second):

mysql> select id, user, host, db, command, state from information_schema.processlist where user = 'system user' order by id asc;
+---------+-------------+-----------------+------+---------+----------------------------------------------------------+
| id      | user        | host            | db   | command | state                                                    |
+---------+-------------+-----------------+------+---------+----------------------------------------------------------+
| 1419920 | system user |                 | NULL | Query   | Waiting for an event from Coordinator                    |
| 1419921 | system user |                 | NULL | Query   | Waiting for an event from Coordinator                    |
| 1419922 | system user |                 | NULL | Query   | Waiting for an event from Coordinator                    |
| 1419923 | system user |                 | NULL | Query   | Waiting for an event from Coordinator                    |
| 1419924 | system user |                 | NULL | Query   | Waiting for an event from Coordinator                    |
| 1419925 | system user |                 | NULL | Query   | Waiting for an event from Coordinator                    |
| 1419918 | system user |                 | NULL | Query   | Waiting for an event from Coordinator                    |
| 1419919 | system user |                 | NULL | Query   | Waiting for an event from Coordinator                    |
| 1419916 | system user | connecting host | NULL | Connect | waiting for handler commit                               |
| 1419917 | system user |                 | NULL | Query   | Replica has read all relay log; waiting for more updates |
+---------+-------------+-----------------+------+---------+----------------------------------------------------------+

The above example is from MySQL 8.0. Output (and what we need to filter for) may be different on older versions of MySQL and on MariaDB.

We need to make this work on MySQL 5.6+ and MariaDB 10.2+ (we don't have to care about older versions).

We then need a graph of states specifically for replication threads, non-exhaustive list here:

Queueing source event to the relay log
Reading event from the relay log
Replica has read all relay log; waiting for more updates
Waiting for an event from Coordinator
Executing event
Applying batch of row changes (write)
Applying batch of row changes (update)
Applying batch of row changes (delete)
Waiting for dependent transaction to commit
starting
freeing items
waiting for handler commit
System lock
Opening tables
closing tables

This will be similar to the graph we already have in MySQL / MySQL Overview / Process States, only this one will be more specialist (only for system user threads) and should be located on the MySQL / MySQL Replication dashboard.

In the above list there are 10 threads, of which 8 are workers. The other two handle relay log reader / coordinator (1419916 in the above example) and relay log writer / puller from master 1419917 in the above example). Identifying them may be a little tricky, but I am not sure we actually need to.

The important thing we need is to see how many threads are in state:

Waiting for an event from Coordinator

which means the worker thread is idle, and

Applying batch of row changes (write)
Applying batch of row changes (update)
Applying batch of row changes (delete)
waiting for handler commit
starting
freeing items
Executing event
Opening tables
closing tables

which means the thread is busy (there may be other busy states, this is just what I pulled out in 30 seconds of quick testing).

The lines on the graph should be stacked (so they should always add up to the total number of threads, with the bottom events from the relay/coordinator threads, and the others in fixed order in the stack.

The idea here is similar to tracking how many events we have in the queue for I/O threads. If our workers are saturated most of the time, we could benefit from more slave parallel worker threads. If they are hardly ever saturated, we need to try to come up with a way to tune parallelism in the replication stream better.

This will eventually become part of the adviser dashboard, but other things will need to be added and considered before that is the case because there are significant differences between MySQL's and MariaDB's parallel replication features and implementations. For now - this is just to monitor thread activity / worker thread saturation.

gordan-bobic commented 8 months ago

Looks like we can get thread IDs from:

SELECT CHANNEL_NAME, WORKER_ID, THREAD_ID
FROM   performance_schema.replication_applier_status_by_worker;