Closed gordan-bobic closed 1 month ago
After DEBUG=1 is switched off, internal queries from ssm-server disappear from QAN. They simply don't show up. I would expect to at least see an error saying the server may not be configured for QAN, but there is nothing at all returned. QAN continues to work fine for external monitored servers.
Yup, currently once DEBUG= 1 is switched off, all internal QAN data will be deleted, so will the internal mysql prometheus metrics. I thought that was expected. Do you think we should keep the internal QAN data? And the internal mysql metrics?
I think internal data expiry should NOT have any special treatment - if it exists, it should expire after the global configured data retention period. Logging should be controlled by DEBUG=1, but expiry should be same as everything else.
OK, then I'm going to make them preserved after DEBUG=1 is switched off, both QAN data and prometheus metric data, until they both get expired. But it might look confusing when you browse the web pages though, it makes you think that DEBUG=1 is still on, but no new QAN/metric data is coming up.
I think it's more confusing the way it currently is - I might want to enable DEBUG=1 to gather some data, then disable it and work on analyzing the data after it is disabled to not have the overheads of it on the server once I have enough data to analyze. It also means the code doesn't have to handle the special case of purging ssm-server data when DEBUG=1 isn't set.
Oh, I see, in that case, it makes sense
OK, turns out the massive QPS of the internal database is caused by the 'User Source' feature we added before, which tries to insert every User@Host
that is parsed from the slow log of the monitored server.
Currently those rows are stored in table ssm.query_user_sources
CREATE TABLE IF NOT EXISTS `ssm`.`query_user_sources` (
query_class_id INT UNSIGNED NOT NULL,
instance_id INT UNSIGNED NOT NULL,
ts TIMESTAMP(6) NOT NULL,
user VARCHAR(128) CHARSET 'utf8' NOT NULL,
host VARCHAR(255) CHARSET 'utf8' NOT NULL,
PRIMARY KEY (query_class_id, instance_id, ts, user, host)
);
I think we should change the ts
from TIMESTAMP(6)
to just TIMESTAMP
, and add a count
column to it:
CREATE TABLE IF NOT EXISTS `ssm`.`query_user_sources` (
query_class_id INT UNSIGNED NOT NULL,
instance_id INT UNSIGNED NOT NULL,
ts TIMESTAMP NOT NULL,
user VARCHAR(128) CHARSET 'utf8' NOT NULL,
host VARCHAR(255) CHARSET 'utf8' NOT NULL,
count INT UNSIGNED NOT NULL
PRIMARY KEY (query_class_id, instance_id, ts, user, host)
);
Which means the qan-agent
on the monitored server should send an aggregated user source data, instead of sending every row to the SSM server.
Yes, switching to timestamp should be fine. And yes, having a count for each batch for (user, host) sounds good, it would be consistent with how we save query counts (I think?). We also send query data aggregated in the same way, right?
Yeah, it's just user source data that is not aggregated.
Consistency is generally a good thing. Let's handle it the same way the query data is handled.
How come the extra load only seems to manifest with DEBUG=1? User data should be collected by the regular harvester on the client node, too.
Just had a thought - we already have query class id that allows us to use that int id instead of a larger value for referencing. How about we have a user table containing something like:
CREATE TABLE user_class
id int unisgnedNOT NULL AUTO_INCREMENT,
User varchar(80),
Host varchar(60),
PRIMARY KEY (id),
UNIQUE KEY (User, Host)
);
and then for each entry in query user sources we have user_class_id instead user and host columns. It will likely save a significant amount of disk space and thus memory.
I got the varchars from definition of mysql.user table in MariaDB 10.3:
MariaDB [mysql]> desc user;
+------------------------+-----------------------------------+------+-----+----------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------------------+-----------------------------------+------+-----+----------+-------+
| Host | char(60) | NO | PRI | | |
| User | char(80) | NO | PRI | | |
do later versions extend it to 128? We should have them of size equal to the maximum of what latest MariaDB and MySQL have, to make sure the data fits without getting truncated.
Let me know if you spot any other cases where this kind of "data dictionary" technique could be used to reduce the overall size of the data.
How come the extra load only seems to manifest with DEBUG=1?
Were you monitoring it on the web page when DEBUG=1 is switched off ? You can't trust those mysql prometheus metrics after DEBUG=1 is switched off, those are just stale data. I monitored the Queries
status from SHOW GLOBAL STATUS
, I can still see the massive query amount.
do later versions extend it to 128? We should have them of size equal to the maximum of what latest MariaDB and MySQL have, to make sure the data fits without getting truncated.
You are right, we should use those sizes, I forgot where did I get that 128 and 255 number from, but definitely not from mysql.user , will fix it.
And agree to have another user_class table instead of storing every user/host, will change it that way
Ah, of course, I see what you mean. User aggregation works the same way (same program) for both ssm-server internally and for regular servers being monitored via programs in the ssm-client package?
Yes, correct.
1) Some privileges are missing for internal processes:
ssm-mysqld_exporter-10000.log
:2) After DEBUG=1 is switched off, internal queries from ssm-server disappear from QAN. They simply don't show up. I would expect to at least see an error saying the server may not be configured for QAN, but there is nothing at all returned. QAN continues to work fine for external monitored servers.
3) Internal database is seeing about 40% of the QPS of the server it monitors due to enormous volume of INSERT IGNORE statements, most of which bounce off the unique index constraints. This seems astronomically high.