Closed Foxeronie closed 1 year ago
Hi,
restarting telegraf needs much longer. real 1m30.335s
I think this is related to the root cause, but not the source of the issue. There is probably some sort of connection hanging with one or more of your mysql servers and it takes time to disconnect when it tries to restart. It looks like some 90 second timeout is going off.
It makes no difference, if I increase the interval for the mysql input.
What about setting a timeout in the dsn, does that provide more details?
After downgrading to 1.25.3 again, the problem is gone.
There were a number of changes to the plugin between 1.25.3 and 1.26.0. Mainly linting + secret store support. How long does this take to reproduce? Could we give you some test artifacts to try to bisect where the issue was introduced?
I think this is related to the root cause, but not the source of the issue. There is probably some sort of connection hanging with one or more of your mysql servers and it takes time to disconnect when it tries to restart. It looks like some 90 second timeout is going off.
You are right, a timeout was reached.
Mar 22 17:11:50 it-dads-glr-cmn-01.domain.de systemd[1]: Stopping Telegraf...
Mar 22 17:13:20 it-dads-glr-cmn-01.domain.de systemd[1]: telegraf.service: State 'stop-sigterm' timed out. Killing.
Mar 22 17:13:20 it-dads-glr-cmn-01.domain.de systemd[1]: telegraf.service: Killing process 2978564 (telegraf) with signal SIGKILL.
Mar 22 17:13:20 it-dads-glr-cmn-01.domain.de systemd[1]: telegraf.service: Killing process 2978570 (telegraf) with signal SIGKILL.
Mar 22 17:13:20 it-dads-glr-cmn-01.domain.de systemd[1]: telegraf.service: Killing process 2978572 (telegraf) with signal SIGKILL.
Mar 22 17:13:20 it-dads-glr-cmn-01.domain.de systemd[1]: telegraf.service: Killing process 2978573 (telegraf) with signal SIGKILL.
Mar 22 17:13:20 it-dads-glr-cmn-01.domain.de systemd[1]: telegraf.service: Killing process 2978574 (telegraf) with signal SIGKILL.
Mar 22 17:13:20 it-dads-glr-cmn-01.domain.de systemd[1]: telegraf.service: Killing process 2978575 (n/a) with signal SIGKILL.
Mar 22 17:13:20 it-dads-glr-cmn-01.domain.de systemd[1]: telegraf.service: Killing process 2978576 (telegraf) with signal SIGKILL.
Mar 22 17:13:20 it-dads-glr-cmn-01.domain.de systemd[1]: telegraf.service: Killing process 2978577 (telegraf) with signal SIGKILL.
Mar 22 17:13:20 it-dads-glr-cmn-01.domain.de systemd[1]: telegraf.service: Killing process 2978578 (n/a) with signal SIGKILL.
Mar 22 17:13:20 it-dads-glr-cmn-01.domain.de systemd[1]: telegraf.service: Killing process 2978579 (n/a) with signal SIGKILL.
Mar 22 17:13:20 it-dads-glr-cmn-01.domain.de systemd[1]: telegraf.service: Killing process 2983980 (telegraf) with signal SIGKILL.
Mar 22 17:13:20 it-dads-glr-cmn-01.domain.de systemd[1]: telegraf.service: Killing process 3014838 (n/a) with signal SIGKILL.
Mar 22 17:13:20 it-dads-glr-cmn-01.domain.de systemd[1]: telegraf.service: Killing process 3014839 (telegraf) with signal SIGKILL.
Mar 22 17:13:20 it-dads-glr-cmn-01.domain.de systemd[1]: telegraf.service: Main process exited, code=killed, status=9/KILL
Mar 22 17:13:20 it-dads-glr-cmn-01.domain.de systemd[1]: telegraf.service: Failed with result 'timeout'.
Mar 22 17:13:20 it-dads-glr-cmn-01.domain.de systemd[1]: Stopped Telegraf.
What about setting a timeout in the dsn, does that provide more details?
I added ?timeout=60s
to every server, but no other output appears in the logs.
1s as another try showed the same result.
The metric collection of telegraf itself is done in under 1 second per server.
2023-03-22T17:17:00.010460945+01:00 stdout F 20230322 16:17:00,it-dads-glr-cmn-01.domain.de,metricbeat,10.89.6.1,74586,0,CONNECT,,,0
2023-03-22T17:17:00.028325870+01:00 stdout F 20230322 16:17:00,it-dads-glr-cmn-01.domain.de,metricbeat,10.89.6.1,74586,2921231,READ,performance_schema,table_io_waits_summary_by_table,
2023-03-22T17:17:00.028611451+01:00 stdout F 20230322 16:17:00,it-dads-glr-cmn-01.domain.de,metricbeat,10.89.6.1,74586,2921232,READ,performance_schema,table_io_waits_summary_by_index_usage,
2023-03-22T17:17:00.029136401+01:00 stdout F 20230322 16:17:00,it-dads-glr-cmn-01.domain.de,metricbeat,10.89.6.1,74586,2921236,READ,performance_schema,table_lock_waits_summary_by_table,
2023-03-22T17:17:00.029425099+01:00 stdout F 20230322 16:17:00,it-dads-glr-cmn-01.domain.de,metricbeat,10.89.6.1,74586,2921237,READ,performance_schema,events_waits_summary_global_by_event_name,
2023-03-22T17:17:00.029514527+01:00 stdout F 20230322 16:17:00,it-dads-glr-cmn-01.domain.de,metricbeat,10.89.6.1,74586,0,DISCONNECT,,,0
There were a number of changes to the plugin between 1.25.3 and 1.26.0. Mainly linting + secret store support. How long does this take to reproduce? Could we give you some test artifacts to try to bisect where the issue was introduced?
It tooks a couple of hours (2 - 12) until the problems start. I think I can't trigger it manually. Sure, we can try some stuff.
One more clarification, does telegraf ever recover after the metric took longer than expected message shows up? Or do metrics entirely stop from this plugin.
In general, that means that telegraf took longer than 5 minutes, we skip the next collection interval to avoid conflicts, and then things should continue as expected at the next interval.
Nope, as soon as the problem starts, the mysql metric collection always fails until I restart the telegraf service.
Alright, both external dependencies were the same in v1.25.0 and v1.26.0:
I have found 3 commits that are in master, but not in the v1.25.0 release branch related to the mysql plugin. Here is what I would ask: first, try the artifacts in https://github.com/influxdata/telegraf/pull/12929 first:
Thanks!
While #12929 reproduced the issue, #12928 is running since friday without problems.
More details about my MySQL configuration: All MariaDB servers are running as container with podman. It's a Galera cluster setup, so I have 3 servers with this setup (all have this issue).
podman version:
podman version 3.4.4
podman compose file:
---
version: '3'
networks:
weboffice:
driver: bridge
services:
weboffice:
env_file: /opt/galera/config/env/it-dads-weboffice.env
logging:
driver: k8s-file
options:
path: "/var/log/podman/it-dads-weboffice.log"
image: bitnami/mariadb-galera:10.6.10
hostname: ${HOST_NAME}
volumes:
- /opt/galera/data/it-dads/weboffice:/bitnami/mariadb/data
- /opt/galera/config/cluster/it-dads/weboffice.cnf:/opt/bitnami/mariadb/conf/bitnami/service.cnf:ro
- /opt/galera/config/certs/${HOST_NAME}/${HOST_NAME}.pem:/bitnami/mariadb/certs/cert.pem:ro
- /opt/galera/config/certs/${HOST_NAME}/${HOST_NAME}.key:/bitnami/mariadb/certs/key.pem:ro
- /opt/galera/config/certs/ca.pem:/bitnami/mariadb/certs/ca.pem:ro
- /opt/galera/config/bootstrap/glr-cmn:/docker-entrypoint-startdb.d
- /opt/galera/config/default/default.cnf:/opt/bitnami/mariadb/conf/bitnami/my_custom.cnf:ro
ports:
- 7306:3306
- 7444:7444
- 7567:7567
- 7568:7568
environment:
# SSL settings
MARIADB_ENABLE_TLS: yes
MARIADB_TLS_CERT_FILE: /bitnami/mariadb/certs/cert.pem
MARIADB_TLS_KEY_FILE: /bitnami/mariadb/certs/key.pem
MARIADB_TLS_CA_FILE: /bitnami/mariadb/certs/ca.pem
# Galera settings
#MARIADB_GALERA_CLUSTER_BOOTSTRAP: yes
MARIADB_GALERA_CLUSTER_ADDRESS: 'gcomm://192.168.217.64:7567,192.168.217.65:7567,192.168.217.66:7567'
MARIADB_EXTRA_FLAGS: "--wsrep_provider_options=ist.recv_addr=${HOST_IP}:7568;ist.recv_bind=0.0.0.0:7568;base_port=7567 --wsrep-node-incoming-address=${HOST_IP}:7306 --wsrep-sst-receive-address=${HOST_IP}:7444"
networks:
- weboffice
deploy:
resources:
limits:
cpus: '3'
memory: 2048M
reservations:
cpus: '0.5'
memory: 2048M
While https://github.com/influxdata/telegraf/pull/12929 reproduced the issue, https://github.com/influxdata/telegraf/pull/12928 is running since friday without problems.
ok so that is pointing to the secret store PR https://github.com/influxdata/telegraf/pull/12591 as the source of the regression. I'll chat with @srebhan
Hello,
I'm experiencing the same issue as above with v1.26.0 , the collecting stops after 1-2 hours and then only shows: [inputs.sqlserver] Collection took longer than expected; not complete after interval of 1m0s No other errors in the logs
@Foxeronie and @N3v3R3nD sorry for causing trouble in your installation! Can you guy please try the artifact in PR #12919 and see if this improves the situation!?
Hello @srebhan I've tested the new artifact, but unfortunately, it still stops collecting after a while:
I did however notice that also my [[inputs.postgresql]] stops in addition to the [[inputs.sqlserver]] while others like [[inputs.internal]] [[inputs.cpu]] for checking the host runs fine at the same time.
Small output from log:
| | 2023-03-29T10:21:01Z W! [inputs.postgresql] Collection took longer than expected; not complete after interval of 1m0s |
| | 2023-03-29T10:21:01Z W! [inputs.sqlserver] Collection took longer than expected; not complete after interval of 1m0s |
| | 2023-03-29T09:48:37Z I! [inputs.sqlserver] Config: Effective Queries: []string{"SQLServerSchedulers", "SQLServerVolumeSpace", "SQLServerCpu", "SQLServerWaitStatsCategorized", "SQLServerDatabaseIO", "SQLServerMemoryClerks", "SQLServerRecentBackups", "SQLServerPerformanceCounters", "SQLServerProperties", "SQLServerRequests"} |
| | 2023-03-29T09:48:37Z I! [inputs.sqlserver] Config: database_type: SQLServer , query_version:0 , azuredb: false |
| | 2023-03-29T09:48:37Z I! [inputs.sqlserver] Config: Effective Queries: []string{"SQLServerRecentBackups", "SQLServerPerformanceCounters", "SQLServerWaitStatsCategorized", "SQLServerDatabaseIO", "SQLServerProperties", "SQLServerVolumeSpace", "SQLServerCpu", "SQLServerMemoryClerks", "SQLServerSchedulers", "SQLServerRequests"} |
| | 2023-03-29T09:48:37Z I! [inputs.sqlserver] Config: database_type: SQLServer , query_version:0 , azuredb: false
Hi @srebhan !
https://github.com/influxdata/telegraf/pull/12919 produces the issue too.
If you need more information to debug just ask.
Best regards, Patrick
Can you monitor the amount of locked memory? You can get it via grep VmLck /proc/<telegraf PIC>/status
...
Probably the same root cause as #12980 and #12982. Hope you guys (@Foxeronie and @N3v3R3nD) can confirm it's the locked memory...
I set a cronjob to every minute logging this value.
[Wed Mar 29 09:10:01 PM CEST 2023]
[Wed Mar 29 09:11:01 PM CEST 2023] VmLck: 12 kB
[Wed Mar 29 09:12:01 PM CEST 2023] VmLck: 12 kB
[Wed Mar 29 09:13:01 PM CEST 2023] VmLck: 12 kB
[Wed Mar 29 09:14:01 PM CEST 2023] VmLck: 12 kB
[Wed Mar 29 09:15:01 PM CEST 2023] VmLck: 12 kB
[Wed Mar 29 09:16:01 PM CEST 2023] VmLck: 12 kB
[Wed Mar 29 09:17:01 PM CEST 2023] VmLck: 12 kB
[Wed Mar 29 09:18:01 PM CEST 2023] VmLck: 12 kB
[Wed Mar 29 09:19:01 PM CEST 2023] VmLck: 12 kB
[Wed Mar 29 09:20:01 PM CEST 2023] VmLck: 12 kB
[Wed Mar 29 09:21:01 PM CEST 2023] VmLck: 12 kB
[Wed Mar 29 09:22:01 PM CEST 2023] VmLck: 12 kB
[Wed Mar 29 09:23:01 PM CEST 2023] VmLck: 12 kB
[Wed Mar 29 09:24:01 PM CEST 2023] VmLck: 12 kB
[Wed Mar 29 09:25:01 PM CEST 2023] VmLck: 12 kB
[Wed Mar 29 09:26:01 PM CEST 2023] VmLck: 12 kB
[Wed Mar 29 09:27:01 PM CEST 2023] VmLck: 12 kB
[Wed Mar 29 09:28:01 PM CEST 2023] VmLck: 12 kB
[Wed Mar 29 09:29:01 PM CEST 2023] VmLck: 12 kB
[Wed Mar 29 09:30:01 PM CEST 2023] VmLck: 12 kB
[Wed Mar 29 09:31:01 PM CEST 2023] VmLck: 12 kB
[Wed Mar 29 09:32:01 PM CEST 2023] VmLck: 12 kB
[Wed Mar 29 09:33:01 PM CEST 2023] VmLck: 12 kB
[Wed Mar 29 09:34:01 PM CEST 2023] VmLck: 12 kB
[Wed Mar 29 09:35:01 PM CEST 2023] VmLck: 12 kB
[Wed Mar 29 09:36:01 PM CEST 2023] VmLck: 12 kB
[Wed Mar 29 09:37:01 PM CEST 2023] VmLck: 12 kB
[Wed Mar 29 09:38:01 PM CEST 2023] VmLck: 12 kB
[Wed Mar 29 09:39:01 PM CEST 2023] VmLck: 12 kB
[Wed Mar 29 09:40:01 PM CEST 2023] VmLck: 12 kB
[Wed Mar 29 09:41:01 PM CEST 2023] VmLck: 12 kB
[Wed Mar 29 09:42:01 PM CEST 2023] VmLck: 12 kB
[Wed Mar 29 09:43:02 PM CEST 2023] VmLck: 12 kB
[Wed Mar 29 09:44:02 PM CEST 2023] VmLck: 12 kB
[Wed Mar 29 09:45:01 PM CEST 2023] VmLck: 12 kB
[Wed Mar 29 09:46:01 PM CEST 2023] VmLck: 12 kB
[Wed Mar 29 09:47:01 PM CEST 2023] VmLck: 12 kB
[Wed Mar 29 09:48:01 PM CEST 2023] VmLck: 12 kB
[Wed Mar 29 09:49:01 PM CEST 2023] VmLck: 12 kB
[Wed Mar 29 09:50:01 PM CEST 2023] VmLck: 12 kB
[Wed Mar 29 09:51:01 PM CEST 2023] VmLck: 12 kB
[Wed Mar 29 09:52:01 PM CEST 2023] VmLck: 32 kB
[Wed Mar 29 09:53:01 PM CEST 2023] VmLck: 32 kB
[Wed Mar 29 09:54:01 PM CEST 2023] VmLck: 32 kB
[Wed Mar 29 09:55:01 PM CEST 2023] VmLck: 32 kB
[Wed Mar 29 09:56:01 PM CEST 2023] VmLck: 32 kB
[Wed Mar 29 09:57:01 PM CEST 2023] VmLck: 32 kB
[Wed Mar 29 09:58:01 PM CEST 2023] VmLck: 32 kB
[Wed Mar 29 09:59:01 PM CEST 2023] VmLck: 32 kB
[Wed Mar 29 10:00:01 PM CEST 2023] VmLck: 32 kB
[Wed Mar 29 10:01:01 PM CEST 2023] VmLck: 32 kB
[Wed Mar 29 10:02:01 PM CEST 2023] VmLck: 32 kB
[Wed Mar 29 10:03:01 PM CEST 2023] VmLck: 32 kB
The value change from 12kB to 32kB matches the beginning of the issue (Prometheus output has a 90s expiration_interval, this could be the extra minute with valid data)
@N3v3R3nD and @Foxeronie can you please check the binary in PR #12988 once CI finished the tests and monitor the locked pages again!?!?
Note: This PR removes locking of the buffer returned by Get()
. This is insecure and not suitable for production as the secret might be paged out (to swap) and thus persisted on-disk. Use with care and only for non-critical systems!
You are talking about the secrets that are used for the mysql connection, right?
Could it be a problem for the secret store (logic), that I don't use passwords? My mysql user can connect without a password since it is a read-only user with limited right on localhost only.
I will test your PR as soon it is ready.
@Foxeronie yes I'm talking about the secrets for the mysql connection. Please note that every DSN is a secret now (even though it might not contain anything sensitive). Telegraf does not make any assumptions of the content other than it being a byte-sequence. So not having a password should not be a problem.
Btw, artifacts can be found here: https://app.circleci.com/pipelines/github/influxdata/telegraf/15623/workflows/9390b4f8-26f3-4fc4-9540-e137277566f5/jobs/245254/artifacts
Ah okay, thanks for the explanation.
The artifact produced the same results. Metric collection stopped ~04:13
[...]
[Fri Mar 31 04:09:01 PM CEST 2023] VmLck: 12 kB
[Fri Mar 31 04:10:01 PM CEST 2023] VmLck: 12 kB
[Fri Mar 31 04:11:01 PM CEST 2023] VmLck: 12 kB
[Fri Mar 31 04:12:01 PM CEST 2023] VmLck: 12 kB
[Fri Mar 31 04:13:01 PM CEST 2023] VmLck: 32 kB
[Fri Mar 31 04:14:01 PM CEST 2023] VmLck: 32 kB
[Fri Mar 31 04:15:01 PM CEST 2023] VmLck: 32 kB
[Fri Mar 31 04:16:01 PM CEST 2023] VmLck: 32 kB
[Fri Mar 31 04:17:01 PM CEST 2023] VmLck: 32 kB
[...]
@Foxeronie with the binary in #12993 you should see a warning if the limit for locked memory is too low. Can you please check your logs!?
Furthermore, if you use a inputs.sqlserver
instance you also need #13002 which has a fix for this plugin.
There are no new warnings in the logs. Sadly our config management software disabled debug mode. Hope the new message doesn't run under debug level. I enabled debug mode again and will give you an update otherwise.
2023-03-31T12:01:44Z I! Starting Telegraf 1.27.0-2be4d234
2023-03-31T12:01:44Z I! Available plugins: 235 inputs, 9 aggregators, 27 processors, 22 parsers, 57 outputs, 2 secret-stores
2023-03-31T12:01:44Z I! Loaded inputs: cpu disk diskio kernel mem mysql net processes swap system
2023-03-31T12:01:44Z I! Loaded aggregators:
2023-03-31T12:01:44Z I! Loaded processors:
2023-03-31T12:01:44Z I! Loaded secretstores:
2023-03-31T12:01:44Z I! Loaded outputs: graphite prometheus_client
2023-03-31T12:01:44Z I! Tags enabled: host=it-dads-glr-cmn-01
2023-03-31T12:01:44Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"it-dads-glr-cmn-01", Flush Interval:10s
2023-03-31T12:01:44Z I! [outputs.prometheus_client] Listening on http://[::]:9273/metrics
2023-03-31T20:11:00Z W! [inputs.mysql] Collection took longer than expected; not complete after interval of 1m0s
2023-03-31T20:12:00Z W! [inputs.mysql] Collection took longer than expected; not complete after interval of 1m0s
2023-03-31T20:13:00Z W! [inputs.mysql] Collection took longer than expected; not complete after interval of 1m0s
Furthermore, if you use a inputs.sqlserver instance you also need https://github.com/influxdata/telegraf/pull/13002 which has a fix for this plugin.
I only have inputs.mysql active.
Hello,
I've been running the artifact from https://github.com/influxdata/telegraf/pull/13002 since Friday now. I can confirm the issue seems to be fixed, I have not had any gaps in my polling since then.
@Foxeronie can you please check again with https://github.com/influxdata/telegraf/pull/13002 which should have all fixes included...
@Foxeronie this is not the correct version... It should read 1.27.0~947190c8
or something like this.
@Foxeronie this is not the correct version... It should read
1.27.0~947190c8
or something like this.
I was writing my post, while you already asked for testing the other artifact. :) Thats why I deleted my comment.
:~# telegraf version
Telegraf 1.27.0-947190c8 (git: pull/13002@947190c8)
Tests already started. 👍
Released the fixes with 1.26.1. Will close this for now, @Foxeronie if you are still having trouble please reopen!
Relevant telegraf.conf
Logs from Telegraf
System info
1.26.0, Ubuntu 22.04
Docker
No response
Steps to reproduce
Expected behavior
mysql metrics collection should work continuous like with 1.25.3
Actual behavior
After a couple of hours telegraf cant collect mysql metrics anymore.
Other inputs are still working.
Additional info
Hello,
It makes no difference, if I increase the interval for the mysql input.
With 1.25.3 the interval was one minute. There was no problem with this.
After the upgrade to 1.26.0 and increasing the interval to five minutes the problem still exists.
2023-03-21T09:45:00Z D! [inputs.mysql] Previous collection has not completed; scheduled collection skipped 2023-03-21T09:45:00Z W! [inputs.mysql] Collection took longer than expected; not complete after interval of 5m0s
As soon the problem starts, restarting telegraf needs much longer.
Problem exists
Normal restart
After downgrading to 1.25.3 again, the problem is gone.
Best regards, Patrick