shatteredsilicon / ssm-submodules

GNU Affero General Public License v3.0
0 stars 2 forks source link

Gather memory usage information for QAN #229

Open gordan-bobic opened 5 months ago

gordan-bobic commented 5 months ago

I suspect this is going to require a new column in the query_class_metrics table, and an additional harvesting method that can be used standalone or in combination with the slow_log/perf_schema harvesters to gather more information about the queries.

This new query harvester would poll the contents of information_schema.processlist. So I guess that means we need a 3rd option for --query-source: --query-source=processlist

Dificulties: It shouldn't harvest the same query instance twice, e.g. if a query takes 10 seconds to run, and it polls every 1 second, it should not record the query 10 times, it should log the query from the last poll where it was seen before it went away (you can tell by the time column in processlist how long it has been running and thus distinguish whether it is the same query or another one much like it running again on the same thread.

This should also be runnable in "auxiliary" mode, where the only thing we care about is the memory_used column. We want to take the query and memory usage and send it up to the SSM server for integration of the memory usage information, rather than replacement of information coming from the slow log. So there would be a separate thread running to fetch this and push it up to SSM server to integrate.

Thing that makes this harder is that it looks like processlist is limited to 64K: https://jira.mariadb.org/browse/MDEV-24139 which means that the query we get from there could be truncated, while the one in the slow log will be complete. So we may have to guess that if the left(query, 65535) matches, and the time is an approximate match (within a few seconds) and whatever other metadata we have matches (e.g. user, host, thread id), we may have to assume it is the same query.

The idea is to track the min/max/avg memory_used for queries.

Reason why this is useful is because some systems have to execute badly written queries that can use hundreds of megabytes of RAM, in part because they need huge tmp_table_size to also be set (or they grind to a halt when they get swapped out to disk). And this would allow for finding queries by most memory used to try to do something about them and prevent out-of-memory conditions.

We would also need an additional sortable column in QAN web interface for the memory_used data.