shatteredsilicon / ssm-submodules

GNU Affero General Public License v3.0
0 stars 2 forks source link

DEBUG=1 problems #281

Closed gordan-bobic closed 1 month ago

gordan-bobic commented 1 month ago

1) Some privileges are missing for internal processes: ssm-mysqld_exporter-10000.log:

ts=2024-07-18T21:44:43.961Z caller=exporter.go:172 level=error msg="Error from scraper" scraper=info_schema.innodb_metrics target=/var/lib/mysql/mysql.sock err="Error 1227 (42000): Access denied; you need (at least one of) the PROCESS privilege(s) for this operation"
ts=2024-07-18T21:47:36.303Z caller=exporter.go:172 level=error msg="Error from scraper" scraper=slave_status target=/var/lib/mysql/mysql.sock err="Error 1227 (42000): Access denied; you need (at least one of) the SUPER, REPLICATION CLIENT privilege(s) for this operation"

2) After DEBUG=1 is switched off, internal queries from ssm-server disappear from QAN. They simply don't show up. I would expect to at least see an error saying the server may not be configured for QAN, but there is nothing at all returned. QAN continues to work fine for external monitored servers.

3) Internal database is seeing about 40% of the QPS of the server it monitors due to enormous volume of INSERT IGNORE statements, most of which bounce off the unique index constraints. This seems astronomically high.

oblitorum commented 1 month ago

After DEBUG=1 is switched off, internal queries from ssm-server disappear from QAN. They simply don't show up. I would expect to at least see an error saying the server may not be configured for QAN, but there is nothing at all returned. QAN continues to work fine for external monitored servers.

Yup, currently once DEBUG= 1 is switched off, all internal QAN data will be deleted, so will the internal mysql prometheus metrics. I thought that was expected. Do you think we should keep the internal QAN data? And the internal mysql metrics?

gordan-bobic commented 1 month ago

I think internal data expiry should NOT have any special treatment - if it exists, it should expire after the global configured data retention period. Logging should be controlled by DEBUG=1, but expiry should be same as everything else.

oblitorum commented 1 month ago

OK, then I'm going to make them preserved after DEBUG=1 is switched off, both QAN data and prometheus metric data, until they both get expired. But it might look confusing when you browse the web pages though, it makes you think that DEBUG=1 is still on, but no new QAN/metric data is coming up.

gordan-bobic commented 1 month ago

I think it's more confusing the way it currently is - I might want to enable DEBUG=1 to gather some data, then disable it and work on analyzing the data after it is disabled to not have the overheads of it on the server once I have enough data to analyze. It also means the code doesn't have to handle the special case of purging ssm-server data when DEBUG=1 isn't set.

oblitorum commented 1 month ago

Oh, I see, in that case, it makes sense

oblitorum commented 1 month ago

OK, turns out the massive QPS of the internal database is caused by the 'User Source' feature we added before, which tries to insert every User@Host that is parsed from the slow log of the monitored server.

Currently those rows are stored in table ssm.query_user_sources

CREATE TABLE IF NOT EXISTS `ssm`.`query_user_sources` (
  query_class_id  INT UNSIGNED NOT NULL,
  instance_id     INT UNSIGNED NOT NULL,
  ts              TIMESTAMP(6) NOT NULL,
  user            VARCHAR(128) CHARSET 'utf8' NOT NULL,
  host            VARCHAR(255) CHARSET 'utf8' NOT NULL,
  PRIMARY KEY (query_class_id, instance_id, ts, user, host)
);

I think we should change the ts from TIMESTAMP(6) to just TIMESTAMP, and add a count column to it:

CREATE TABLE IF NOT EXISTS `ssm`.`query_user_sources` (
  query_class_id  INT UNSIGNED NOT NULL,
  instance_id     INT UNSIGNED NOT NULL,
  ts              TIMESTAMP NOT NULL,
  user            VARCHAR(128) CHARSET 'utf8' NOT NULL,
  host            VARCHAR(255) CHARSET 'utf8' NOT NULL,
  count           INT UNSIGNED NOT NULL
  PRIMARY KEY (query_class_id, instance_id, ts, user, host)
);

Which means the qan-agent on the monitored server should send an aggregated user source data, instead of sending every row to the SSM server.

gordan-bobic commented 1 month ago

Yes, switching to timestamp should be fine. And yes, having a count for each batch for (user, host) sounds good, it would be consistent with how we save query counts (I think?). We also send query data aggregated in the same way, right?

oblitorum commented 1 month ago

Yeah, it's just user source data that is not aggregated.

gordan-bobic commented 1 month ago

Consistency is generally a good thing. Let's handle it the same way the query data is handled.

How come the extra load only seems to manifest with DEBUG=1? User data should be collected by the regular harvester on the client node, too.

gordan-bobic commented 1 month ago

Just had a thought - we already have query class id that allows us to use that int id instead of a larger value for referencing. How about we have a user table containing something like:

CREATE TABLE user_class
    id int unisgnedNOT NULL AUTO_INCREMENT,
    User varchar(80),
    Host varchar(60),
    PRIMARY KEY (id),
    UNIQUE KEY (User, Host)
);

and then for each entry in query user sources we have user_class_id instead user and host columns. It will likely save a significant amount of disk space and thus memory.

I got the varchars from definition of mysql.user table in MariaDB 10.3:

MariaDB [mysql]> desc user;
+------------------------+-----------------------------------+------+-----+----------+-------+
| Field                  | Type                              | Null | Key | Default  | Extra |
+------------------------+-----------------------------------+------+-----+----------+-------+
| Host                   | char(60)                          | NO   | PRI |          |       |
| User                   | char(80)                          | NO   | PRI |          |       |

do later versions extend it to 128? We should have them of size equal to the maximum of what latest MariaDB and MySQL have, to make sure the data fits without getting truncated.

Let me know if you spot any other cases where this kind of "data dictionary" technique could be used to reduce the overall size of the data.

oblitorum commented 1 month ago

How come the extra load only seems to manifest with DEBUG=1?

Were you monitoring it on the web page when DEBUG=1 is switched off ? You can't trust those mysql prometheus metrics after DEBUG=1 is switched off, those are just stale data. I monitored the Queries status from SHOW GLOBAL STATUS, I can still see the massive query amount.

do later versions extend it to 128? We should have them of size equal to the maximum of what latest MariaDB and MySQL have, to make sure the data fits without getting truncated.

You are right, we should use those sizes, I forgot where did I get that 128 and 255 number from, but definitely not from mysql.user , will fix it.

oblitorum commented 1 month ago

And agree to have another user_class table instead of storing every user/host, will change it that way

gordan-bobic commented 1 month ago

Ah, of course, I see what you mean. User aggregation works the same way (same program) for both ssm-server internally and for regular servers being monitored via programs in the ssm-client package?

oblitorum commented 1 month ago

Yes, correct.