Open morgo opened 3 years ago
I took a look at this today. I instrumented some of the common domain tasks and then ran a tpcc bench:
mysql> select * from information_schema.background_tasks order by sum_latency desc;
+------------------------------+----------------------------+----------------------------+------------+---------------+------------+------------+--------------+-------------+-------------+-------------+
| NAME | LAST_STARTED_AT | LAST_FINISHED_AT | LAST_ERROR | LAST_ERROR_AT | EXEC_COUNT | SUM_ERRORS | SUM_LATENCY | MAX_LATENCY | MIN_LATENCY | AVG_LATENCY |
+------------------------------+----------------------------+----------------------------+------------+---------------+------------+------------+--------------+-------------+-------------+-------------+
| stats_refresh_vars | 2021-09-02 13:06:35.579417 | 2021-09-02 13:06:35.579423 | NULL | NULL | 661 | 0 | 243487934430 | 65039384343 | 1460 | 368362987 |
| stats_update | 2021-09-02 13:06:35.579423 | 2021-09-02 13:06:35.587883 | NULL | NULL | 661 | 0 | 6934087802 | 278466538 | 685273 | 10490299 |
| telemetry_initial_run | 2021-09-02 12:30:01.848558 | 2021-09-02 12:30:02.579291 | NULL | NULL | 1 | 0 | 730732583 | 730732583 | 730732583 | 730732583 |
| stats_load_needed_histograms | 2021-09-02 13:06:35.587884 | 2021-09-02 13:06:35.588064 | NULL | NULL | 661 | 0 | 657926907 | 291482777 | 78970 | 995350 |
| rebuild_sysvar_cache | 2021-09-02 13:06:32.116737 | 2021-09-02 13:06:32.118612 | NULL | NULL | 75 | 0 | 243429942 | 24292307 | 1267515 | 3245732 |
| reload_privileges | 2021-09-02 13:00:01.958983 | 2021-09-02 13:00:02.025710 | NULL | NULL | 4 | 0 | 190104821 | 66727098 | 11162729 | 47526205 |
| reload_infoschema | 2021-09-02 13:06:24.330184 | 2021-09-02 13:06:24.330786 | NULL | NULL | 171 | 0 | 160097787 | 15003494 | 189811 | 936244 |
| stats_init | 2021-09-02 12:30:02.579355 | 2021-09-02 12:30:02.584431 | NULL | NULL | 1 | 0 | 5076289 | 5076289 | 5076289 | 5076289 |
+------------------------------+----------------------------+----------------------------+------------+---------------+------------+------------+--------------+-------------+-------------+-------------+
8 rows in set (0.00 sec)
mysql> SELECT name, format_nano_time(sum_latency), format_nano_time(avg_latency), format_nano_time(max_latency) from information_schema.background_tasks order by sum_latency desc;
+------------------------------+-------------------------------+-------------------------------+-------------------------------+
| name | format_nano_time(sum_latency) | format_nano_time(avg_latency) | format_nano_time(max_latency) |
+------------------------------+-------------------------------+-------------------------------+-------------------------------+
| stats_refresh_vars | 4.06 min | 364.50 ms | 1.08 min |
| stats_update | 6.95 s | 10.40 ms | 278.47 ms |
| telemetry_initial_run | 730.73 ms | 730.73 ms | 730.73 ms |
| stats_load_needed_histograms | 659.27 ms | 986.93 us | 291.48 ms |
| rebuild_sysvar_cache | 243.43 ms | 3.25 ms | 24.29 ms |
| reload_privileges | 190.10 ms | 47.53 ms | 66.73 ms |
| reload_infoschema | 160.43 ms | 932.74 us | 15.00 ms |
| stats_init | 5.08 ms | 5.08 ms | 5.08 ms |
+------------------------------+-------------------------------+-------------------------------+-------------------------------+
8 rows in set (0.01 sec)
There's still a few small details to work out. stats_refresh_vars
is slow because lock time is included - but I actually think we can get rid of this background task because of sysvar cache.
MySQL information schema has a negative impact on performance. Do we have the same problem for TiDB?
MySQL information schema has a negative impact on performance.
I think the problem with things like information_schema.processlist
is that it might need to take some locks, so running queries on this on a high frequency might slowdown other processes as they might have to wait on these locks. Another part is that information_schema
tables are not indexed (at least in older versions), but as these are in memory and often small this shouldn't be much of an issue in practice. The exact implementation of each table may also impact the performance.
Do we have the same problem for TiDB?
I don't think we have the same issues as MySQL, but there might be other issues. Best to use some benchmark to check the performance of the thread querying the I_S table and also the performance of other thread, including connection setup.
@morgo
I have no problem with the design of information_schema.background_tasks, except we'd better add a 'duration' field and put
'LAST_ERROR', 'LAST_ERROR_AT' in the end of a row.
Maybe there will be some performance penalty as @dveeden mentioned above. We need to evaluate it.
MySQL information schema has a negative impact on performance. Do we have the same problem for TiDB?
Yes we do. But this lacks context.
There is an observer effect from recording where we it always has some cost to measure. Because these are background tasks, even if they take 10% longer to execute (high estimate), it is unlikely to impact user workloads. Ideally the server only spends a small percentage of time on background tasks, which initial measurement seems to show is true.
The second point is about information_schema
implementation. It does require a lock to copy the summary, which would briefly prevent background tasks from writing new values. MySQL has cases like @dveeden points out where show processlist
blocks new connections, which is much worse.
The memory usage of both TiDB and MySQL (pre 8.0) infoschema is also very high because it uses several copies internally (copying from the tracking struct; filling out the row in infoschema etc) and in TiDB there are no indexes or pushdown conditions. The impact of this limitation here is fairly low because there is only one row per background job. It is much higher in the case of information_schema.tables
on a system with many tables.
@ThomasYYYY your comment about MySQL information schema might also be about how pre-8.0 used the filesystem for data dictionary? We don't have that limitation in TiDB.
@morgo Understood the difference between MySQL and TiDB. I would like the background job information showed in information schema by default in new TiDB version. Since there is some side effect, document should tell users the potential impact and how to turn off the background job information.
OK. Lets say for argument's sake the metric is "no drop in QPS" not "no performance impact".
My point is there is a performance difference, but it's irrelevant because it's part of low-impact background tasks.
Feature Request
Is your feature request related to a problem? Please describe:
Currently the TiDB server has a number of background jobs which are not visible in SHOW PROCESSLIST, and the success of, or resources used are non-observable. The schedule of these tasks is also not visible (except for GC).
For example:
The list can manually be observed in
domain/domain.go
.Describe the feature you'd like:
It should be possible to observe:
Describe alternatives you've considered:
Ideally this could be in an information schema table, but a metrics schema table would work fine too (or in grafana).
Teachability, Documentation, Adoption, Migration Strategy:
It is useful for advanced debugging cases, since the current scenario is that we likely run background tasks too frequently. But we also have no idea the resources they require.