MySQL collection took longer than expected error, after update from '1.25.3-1' to '1.26.0-1'

Foxeronie commented 1 year ago

Relevant telegraf.conf

[global_tags]
[agent]
  hostname = "it-dads-glr-cmn-01"
  omit_hostname = false
  interval = "60s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = ""
  logfile = "/var/log/telegraf/telegraf.log"
  debug = true
  quiet = false
[outputs.graphite]
graphite_tag_support = false
prefix = "telegraf"
servers = ["it-carbon-relay10.domain.de:2003", "it-carbon-relay11.domain.de:2003"]
template = "host.measurement.tags.field"
timeout = 10
[inputs.cpu]
percpu = false
totalcpu = true
[inputs.disk]
ignore_fs = ["afs", "aufs", "auristorfs", "beegfs", "cvmfs2", "devfs", "devtmpfs", "fuse.gvfsd-fuse", "fuse", "gpfs", "iso9660", "nfs", "nfs4", "overlay", "shm", "squashfs", "tmpfs"]
[inputs.diskio.tagdrop]
name = ["loop*"]
[inputs.net]
interfaces = ["bond*", "data*", "em*", "en*", "eth*", "ib*", "lo", "p[0-9]p[0-9]"]
[[inputs.kernel]]
[[inputs.mem]]
[[inputs.processes]]
[[inputs.swap]]
[[inputs.system]]
[[inputs.mysql]]
  interval = "300s"
  servers = ["metricbeat@tcp(127.0.0.1:4306)/",
             "metricbeat@tcp(127.0.0.1:5306)/",
             "metricbeat@tcp(127.0.0.1:6306)/",
             "metricbeat@tcp(127.0.0.1:7306)/",
             "metricbeat@tcp(127.0.0.1:8306)/",
             "metricbeat@tcp(127.0.0.1:9306)/",
             "metricbeat@tcp(127.0.0.1:10306)/",
             "metricbeat@tcp(127.0.0.1:11306)/",
             "metricbeat@tcp(127.0.0.1:12306)/"]
  perf_events_statements_digest_text_limit  = 120
  perf_events_statements_limit              = 250
  perf_events_statements_time_limit         = 86400
  table_schema_databases                    = []
  gather_table_schema                       = false
  gather_process_list                       = true
  gather_user_statistics                    = true
  gather_info_schema_auto_inc               = true
  gather_innodb_metrics                     = true
  gather_slave_status                       = true
  gather_binary_logs                        = false
  gather_table_io_waits                     = true
  gather_table_lock_waits                   = true
  gather_index_io_waits                     = true
  gather_event_waits                        = true
  gather_file_events_stats                  = false
  gather_perf_events_statements             = false
  interval_slow                             = "60m"
  mariadb_dialect                           = true
[[outputs.prometheus_client]]
  listen = ":9273"
  ip_range = ["192.168.46.58/32", "127.0.0.1/32"]
  expiration_interval = "90s"

Logs from Telegraf

2023-03-21T05:29:07Z D! [outputs.prometheus_client] Buffer fullness: 0 / 10000 metrics
2023-03-21T05:29:07Z D! [outputs.graphite] Wrote batch of 33 metrics in 12.592805ms
2023-03-21T05:29:07Z D! [outputs.graphite] Buffer fullness: 0 / 10000 metrics
2023-03-21T05:29:17Z D! [outputs.prometheus_client] Buffer fullness: 0 / 10000 metrics
2023-03-21T05:29:17Z D! [outputs.graphite] Buffer fullness: 0 / 10000 metrics
2023-03-21T05:29:27Z D! [outputs.prometheus_client] Buffer fullness: 0 / 10000 metrics
2023-03-21T05:29:27Z D! [outputs.graphite] Buffer fullness: 0 / 10000 metrics
2023-03-21T05:29:37Z D! [outputs.prometheus_client] Buffer fullness: 0 / 10000 metrics
2023-03-21T05:29:37Z D! [outputs.graphite] Buffer fullness: 0 / 10000 metrics
2023-03-21T05:29:47Z D! [outputs.prometheus_client] Buffer fullness: 0 / 10000 metrics
2023-03-21T05:29:47Z D! [outputs.graphite] Buffer fullness: 0 / 10000 metrics
2023-03-21T05:29:57Z D! [outputs.prometheus_client] Buffer fullness: 0 / 10000 metrics
2023-03-21T05:29:57Z D! [outputs.graphite] Buffer fullness: 0 / 10000 metrics
2023-03-21T05:30:00Z D! [inputs.mysql] Previous collection has not completed; scheduled collection skipped
2023-03-21T05:30:00Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/sys/kernel/debug/tracing"): permission denied
2023-03-21T05:30:00Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/var/lib/containers/storage/overlay"): permission denied
2023-03-21T05:30:00Z W! [inputs.mysql] Collection took longer than expected; not complete after interval of 5m0s
2023-03-21T05:30:07Z D! [outputs.prometheus_client] Wrote batch of 33 metrics in 1.54032ms
2023-03-21T05:30:07Z D! [outputs.prometheus_client] Buffer fullness: 0 / 10000 metrics
2023-03-21T05:30:07Z D! [outputs.graphite] Wrote batch of 33 metrics in 13.107871ms
2023-03-21T05:30:07Z D! [outputs.graphite] Buffer fullness: 0 / 10000 metrics
2023-03-21T05:30:17Z D! [outputs.prometheus_client] Buffer fullness: 0 / 10000 metrics
2023-03-21T05:30:17Z D! [outputs.graphite] Buffer fullness: 0 / 10000 metrics
2023-03-21T05:30:27Z D! [outputs.prometheus_client] Buffer fullness: 0 / 10000 metrics
2023-03-21T05:30:27Z D! [outputs.graphite] Buffer fullness: 0 / 10000 metrics
2023-03-21T05:30:37Z D! [outputs.prometheus_client] Buffer fullness: 0 / 10000 metrics
2023-03-21T05:30:37Z D! [outputs.graphite] Buffer fullness: 0 / 10000 metrics
2023-03-21T05:30:47Z D! [outputs.prometheus_client] Buffer fullness: 0 / 10000 metrics
2023-03-21T05:30:47Z D! [outputs.graphite] Buffer fullness: 0 / 10000 metrics
2023-03-21T05:30:57Z D! [outputs.prometheus_client] Buffer fullness: 0 / 10000 metrics
2023-03-21T05:30:57Z D! [outputs.graphite] Buffer fullness: 0 / 10000 metrics

System info

1.26.0, Ubuntu 22.04

Docker

No response

Steps to reproduce

set a mysql input
update telegraf from 1.25.3 to 1.26.0
wait some hours

Expected behavior

mysql metrics collection should work continuous like with 1.25.3

Actual behavior

After a couple of hours telegraf cant collect mysql metrics anymore.

Other inputs are still working.

Additional info

Hello,

It makes no difference, if I increase the interval for the mysql input.

With 1.25.3 the interval was one minute. There was no problem with this.

After the upgrade to 1.26.0 and increasing the interval to five minutes the problem still exists.

2023-03-21T09:45:00Z D! [inputs.mysql] Previous collection has not completed; scheduled collection skipped 2023-03-21T09:45:00Z W! [inputs.mysql] Collection took longer than expected; not complete after interval of 5m0s

As soon the problem starts, restarting telegraf needs much longer.

Problem exists

:~# time systemctl restart telegraf;

real    1m30.335s
user    0m0.007s
sys 0m0.008s

Normal restart

:~# time systemctl restart telegraf;

real    0m0.117s
user    0m0.008s
sys 0m0.009s

After downgrading to 1.25.3 again, the problem is gone.

Best regards, Patrick

powersj commented 1 year ago

Hi,

restarting telegraf needs much longer. real 1m30.335s

I think this is related to the root cause, but not the source of the issue. There is probably some sort of connection hanging with one or more of your mysql servers and it takes time to disconnect when it tries to restart. It looks like some 90 second timeout is going off.

It makes no difference, if I increase the interval for the mysql input.

What about setting a timeout in the dsn, does that provide more details?

After downgrading to 1.25.3 again, the problem is gone.

There were a number of changes to the plugin between 1.25.3 and 1.26.0. Mainly linting + secret store support. How long does this take to reproduce? Could we give you some test artifacts to try to bisect where the issue was introduced?

Foxeronie commented 1 year ago

I think this is related to the root cause, but not the source of the issue. There is probably some sort of connection hanging with one or more of your mysql servers and it takes time to disconnect when it tries to restart. It looks like some 90 second timeout is going off.

You are right, a timeout was reached.

Mar 22 17:11:50 it-dads-glr-cmn-01.domain.de systemd[1]: Stopping Telegraf...
Mar 22 17:13:20 it-dads-glr-cmn-01.domain.de systemd[1]: telegraf.service: State 'stop-sigterm' timed out. Killing.
Mar 22 17:13:20 it-dads-glr-cmn-01.domain.de systemd[1]: telegraf.service: Killing process 2978564 (telegraf) with signal SIGKILL.
Mar 22 17:13:20 it-dads-glr-cmn-01.domain.de systemd[1]: telegraf.service: Killing process 2978570 (telegraf) with signal SIGKILL.
Mar 22 17:13:20 it-dads-glr-cmn-01.domain.de systemd[1]: telegraf.service: Killing process 2978572 (telegraf) with signal SIGKILL.
Mar 22 17:13:20 it-dads-glr-cmn-01.domain.de systemd[1]: telegraf.service: Killing process 2978573 (telegraf) with signal SIGKILL.
Mar 22 17:13:20 it-dads-glr-cmn-01.domain.de systemd[1]: telegraf.service: Killing process 2978574 (telegraf) with signal SIGKILL.
Mar 22 17:13:20 it-dads-glr-cmn-01.domain.de systemd[1]: telegraf.service: Killing process 2978575 (n/a) with signal SIGKILL.
Mar 22 17:13:20 it-dads-glr-cmn-01.domain.de systemd[1]: telegraf.service: Killing process 2978576 (telegraf) with signal SIGKILL.
Mar 22 17:13:20 it-dads-glr-cmn-01.domain.de systemd[1]: telegraf.service: Killing process 2978577 (telegraf) with signal SIGKILL.
Mar 22 17:13:20 it-dads-glr-cmn-01.domain.de systemd[1]: telegraf.service: Killing process 2978578 (n/a) with signal SIGKILL.
Mar 22 17:13:20 it-dads-glr-cmn-01.domain.de systemd[1]: telegraf.service: Killing process 2978579 (n/a) with signal SIGKILL.
Mar 22 17:13:20 it-dads-glr-cmn-01.domain.de systemd[1]: telegraf.service: Killing process 2983980 (telegraf) with signal SIGKILL.
Mar 22 17:13:20 it-dads-glr-cmn-01.domain.de systemd[1]: telegraf.service: Killing process 3014838 (n/a) with signal SIGKILL.
Mar 22 17:13:20 it-dads-glr-cmn-01.domain.de systemd[1]: telegraf.service: Killing process 3014839 (telegraf) with signal SIGKILL.
Mar 22 17:13:20 it-dads-glr-cmn-01.domain.de systemd[1]: telegraf.service: Main process exited, code=killed, status=9/KILL
Mar 22 17:13:20 it-dads-glr-cmn-01.domain.de systemd[1]: telegraf.service: Failed with result 'timeout'.
Mar 22 17:13:20 it-dads-glr-cmn-01.domain.de systemd[1]: Stopped Telegraf.

What about setting a timeout in the dsn, does that provide more details?

I added ?timeout=60s to every server, but no other output appears in the logs. 1s as another try showed the same result.

The metric collection of telegraf itself is done in under 1 second per server.

2023-03-22T17:17:00.010460945+01:00 stdout F 20230322 16:17:00,it-dads-glr-cmn-01.domain.de,metricbeat,10.89.6.1,74586,0,CONNECT,,,0
2023-03-22T17:17:00.028325870+01:00 stdout F 20230322 16:17:00,it-dads-glr-cmn-01.domain.de,metricbeat,10.89.6.1,74586,2921231,READ,performance_schema,table_io_waits_summary_by_table,
2023-03-22T17:17:00.028611451+01:00 stdout F 20230322 16:17:00,it-dads-glr-cmn-01.domain.de,metricbeat,10.89.6.1,74586,2921232,READ,performance_schema,table_io_waits_summary_by_index_usage,
2023-03-22T17:17:00.029136401+01:00 stdout F 20230322 16:17:00,it-dads-glr-cmn-01.domain.de,metricbeat,10.89.6.1,74586,2921236,READ,performance_schema,table_lock_waits_summary_by_table,
2023-03-22T17:17:00.029425099+01:00 stdout F 20230322 16:17:00,it-dads-glr-cmn-01.domain.de,metricbeat,10.89.6.1,74586,2921237,READ,performance_schema,events_waits_summary_global_by_event_name,
2023-03-22T17:17:00.029514527+01:00 stdout F 20230322 16:17:00,it-dads-glr-cmn-01.domain.de,metricbeat,10.89.6.1,74586,0,DISCONNECT,,,0

There were a number of changes to the plugin between 1.25.3 and 1.26.0. Mainly linting + secret store support. How long does this take to reproduce? Could we give you some test artifacts to try to bisect where the issue was introduced?

It tooks a couple of hours (2 - 12) until the problems start. I think I can't trigger it manually. Sure, we can try some stuff.

powersj commented 1 year ago

One more clarification, does telegraf ever recover after the metric took longer than expected message shows up? Or do metrics entirely stop from this plugin.

In general, that means that telegraf took longer than 5 minutes, we skip the next collection interval to avoid conflicts, and then things should continue as expected at the next interval.

Foxeronie commented 1 year ago

Nope, as soon as the problem starts, the mysql metric collection always fails until I restart the telegraf service.

powersj commented 1 year ago

Alright, both external dependencies were the same in v1.25.0 and v1.26.0:

github.com/go-sql-driver/mysql was v1.6.0
github.com/gofrs/uuid/v5 was v5.0.0

I have found 3 commits that are in master, but not in the v1.25.0 release branch related to the mysql plugin. Here is what I would ask: first, try the artifacts in https://github.com/influxdata/telegraf/pull/12929 first:

if 12929 reproduces the issue, then try the artifacts in https://github.com/influxdata/telegraf/pull/12928 and let me know the result.
if 12929 does not reproduce the issue, then try the artifacts in https://github.com/influxdata/telegraf/pull/12930 and let me know the result.

Thanks!

Foxeronie commented 1 year ago

While #12929 reproduced the issue, #12928 is running since friday without problems.

More details about my MySQL configuration: All MariaDB servers are running as container with podman. It's a Galera cluster setup, so I have 3 servers with this setup (all have this issue).

podman version:

podman version 3.4.4

podman compose file:

---
version: '3'
networks:
  weboffice:
    driver: bridge

services:
  weboffice:
    env_file: /opt/galera/config/env/it-dads-weboffice.env
    logging:
      driver: k8s-file
      options:
        path: "/var/log/podman/it-dads-weboffice.log"
    image: bitnami/mariadb-galera:10.6.10
    hostname: ${HOST_NAME}
    volumes:
      - /opt/galera/data/it-dads/weboffice:/bitnami/mariadb/data
      - /opt/galera/config/cluster/it-dads/weboffice.cnf:/opt/bitnami/mariadb/conf/bitnami/service.cnf:ro
      - /opt/galera/config/certs/${HOST_NAME}/${HOST_NAME}.pem:/bitnami/mariadb/certs/cert.pem:ro
      - /opt/galera/config/certs/${HOST_NAME}/${HOST_NAME}.key:/bitnami/mariadb/certs/key.pem:ro
      - /opt/galera/config/certs/ca.pem:/bitnami/mariadb/certs/ca.pem:ro
      - /opt/galera/config/bootstrap/glr-cmn:/docker-entrypoint-startdb.d
      - /opt/galera/config/default/default.cnf:/opt/bitnami/mariadb/conf/bitnami/my_custom.cnf:ro
    ports:
      - 7306:3306
      - 7444:7444
      - 7567:7567
      - 7568:7568
    environment:
      # SSL settings
      MARIADB_ENABLE_TLS: yes
      MARIADB_TLS_CERT_FILE: /bitnami/mariadb/certs/cert.pem
      MARIADB_TLS_KEY_FILE: /bitnami/mariadb/certs/key.pem
      MARIADB_TLS_CA_FILE: /bitnami/mariadb/certs/ca.pem
      # Galera settings
      #MARIADB_GALERA_CLUSTER_BOOTSTRAP: yes
      MARIADB_GALERA_CLUSTER_ADDRESS: 'gcomm://192.168.217.64:7567,192.168.217.65:7567,192.168.217.66:7567'
      MARIADB_EXTRA_FLAGS: "--wsrep_provider_options=ist.recv_addr=${HOST_IP}:7568;ist.recv_bind=0.0.0.0:7568;base_port=7567 --wsrep-node-incoming-address=${HOST_IP}:7306 --wsrep-sst-receive-address=${HOST_IP}:7444"
    networks:
      - weboffice
    deploy:
      resources:
        limits:
          cpus: '3'
          memory: 2048M
        reservations:
          cpus: '0.5'
          memory: 2048M

powersj commented 1 year ago

While https://github.com/influxdata/telegraf/pull/12929 reproduced the issue, https://github.com/influxdata/telegraf/pull/12928 is running since friday without problems.

ok so that is pointing to the secret store PR https://github.com/influxdata/telegraf/pull/12591 as the source of the regression. I'll chat with @srebhan

N3v3R3nD commented 1 year ago

Hello,

I'm experiencing the same issue as above with v1.26.0 , the collecting stops after 1-2 hours and then only shows: [inputs.sqlserver] Collection took longer than expected; not complete after interval of 1m0s No other errors in the logs

srebhan commented 1 year ago

@Foxeronie and @N3v3R3nD sorry for causing trouble in your installation! Can you guy please try the artifact in PR #12919 and see if this improves the situation!?

N3v3R3nD commented 1 year ago

Hello @srebhan I've tested the new artifact, but unfortunately, it still stops collecting after a while:

I did however notice that also my [[inputs.postgresql]] stops in addition to the [[inputs.sqlserver]] while others like [[inputs.internal]] [[inputs.cpu]] for checking the host runs fine at the same time.

Small output from log:


  |   | 2023-03-29T10:21:01Z W! [inputs.postgresql] Collection took longer than expected; not complete after interval of 1m0s |  
  |   | 2023-03-29T10:21:01Z W! [inputs.sqlserver] Collection took longer than expected; not complete after interval of 1m0s |  
  |   | 2023-03-29T09:48:37Z I! [inputs.sqlserver] Config: Effective Queries: []string{"SQLServerSchedulers", "SQLServerVolumeSpace", "SQLServerCpu", "SQLServerWaitStatsCategorized", "SQLServerDatabaseIO", "SQLServerMemoryClerks", "SQLServerRecentBackups", "SQLServerPerformanceCounters", "SQLServerProperties", "SQLServerRequests"} |  
  |   | 2023-03-29T09:48:37Z I! [inputs.sqlserver] Config: database_type: SQLServer , query_version:0 , azuredb: false |  
  |   | 2023-03-29T09:48:37Z I! [inputs.sqlserver] Config: Effective Queries: []string{"SQLServerRecentBackups", "SQLServerPerformanceCounters", "SQLServerWaitStatsCategorized", "SQLServerDatabaseIO", "SQLServerProperties", "SQLServerVolumeSpace", "SQLServerCpu", "SQLServerMemoryClerks", "SQLServerSchedulers", "SQLServerRequests"} |  
  |   | 2023-03-29T09:48:37Z I! [inputs.sqlserver] Config: database_type: SQLServer , query_version:0 , azuredb: false

Foxeronie commented 1 year ago

Hi @srebhan !

https://github.com/influxdata/telegraf/pull/12919 produces the issue too.

If you need more information to debug just ask.

Best regards, Patrick

srebhan commented 1 year ago

Can you monitor the amount of locked memory? You can get it via grep VmLck /proc/<telegraf PIC>/status...

srebhan commented 1 year ago

Probably the same root cause as #12980 and #12982. Hope you guys (@Foxeronie and @N3v3R3nD) can confirm it's the locked memory...

Foxeronie commented 1 year ago

I set a cronjob to every minute logging this value.

[Wed Mar 29 09:10:01 PM CEST 2023]
[Wed Mar 29 09:11:01 PM CEST 2023] VmLck:         12 kB
[Wed Mar 29 09:12:01 PM CEST 2023] VmLck:         12 kB
[Wed Mar 29 09:13:01 PM CEST 2023] VmLck:         12 kB
[Wed Mar 29 09:14:01 PM CEST 2023] VmLck:         12 kB
[Wed Mar 29 09:15:01 PM CEST 2023] VmLck:         12 kB
[Wed Mar 29 09:16:01 PM CEST 2023] VmLck:         12 kB
[Wed Mar 29 09:17:01 PM CEST 2023] VmLck:         12 kB
[Wed Mar 29 09:18:01 PM CEST 2023] VmLck:         12 kB
[Wed Mar 29 09:19:01 PM CEST 2023] VmLck:         12 kB
[Wed Mar 29 09:20:01 PM CEST 2023] VmLck:         12 kB
[Wed Mar 29 09:21:01 PM CEST 2023] VmLck:         12 kB
[Wed Mar 29 09:22:01 PM CEST 2023] VmLck:         12 kB
[Wed Mar 29 09:23:01 PM CEST 2023] VmLck:         12 kB
[Wed Mar 29 09:24:01 PM CEST 2023] VmLck:         12 kB
[Wed Mar 29 09:25:01 PM CEST 2023] VmLck:         12 kB
[Wed Mar 29 09:26:01 PM CEST 2023] VmLck:         12 kB
[Wed Mar 29 09:27:01 PM CEST 2023] VmLck:         12 kB
[Wed Mar 29 09:28:01 PM CEST 2023] VmLck:         12 kB
[Wed Mar 29 09:29:01 PM CEST 2023] VmLck:         12 kB
[Wed Mar 29 09:30:01 PM CEST 2023] VmLck:         12 kB
[Wed Mar 29 09:31:01 PM CEST 2023] VmLck:         12 kB
[Wed Mar 29 09:32:01 PM CEST 2023] VmLck:         12 kB
[Wed Mar 29 09:33:01 PM CEST 2023] VmLck:         12 kB
[Wed Mar 29 09:34:01 PM CEST 2023] VmLck:         12 kB
[Wed Mar 29 09:35:01 PM CEST 2023] VmLck:         12 kB
[Wed Mar 29 09:36:01 PM CEST 2023] VmLck:         12 kB
[Wed Mar 29 09:37:01 PM CEST 2023] VmLck:         12 kB
[Wed Mar 29 09:38:01 PM CEST 2023] VmLck:         12 kB
[Wed Mar 29 09:39:01 PM CEST 2023] VmLck:         12 kB
[Wed Mar 29 09:40:01 PM CEST 2023] VmLck:         12 kB
[Wed Mar 29 09:41:01 PM CEST 2023] VmLck:         12 kB
[Wed Mar 29 09:42:01 PM CEST 2023] VmLck:         12 kB
[Wed Mar 29 09:43:02 PM CEST 2023] VmLck:         12 kB
[Wed Mar 29 09:44:02 PM CEST 2023] VmLck:         12 kB
[Wed Mar 29 09:45:01 PM CEST 2023] VmLck:         12 kB
[Wed Mar 29 09:46:01 PM CEST 2023] VmLck:         12 kB
[Wed Mar 29 09:47:01 PM CEST 2023] VmLck:         12 kB
[Wed Mar 29 09:48:01 PM CEST 2023] VmLck:         12 kB
[Wed Mar 29 09:49:01 PM CEST 2023] VmLck:         12 kB
[Wed Mar 29 09:50:01 PM CEST 2023] VmLck:         12 kB
[Wed Mar 29 09:51:01 PM CEST 2023] VmLck:         12 kB
[Wed Mar 29 09:52:01 PM CEST 2023] VmLck:         32 kB
[Wed Mar 29 09:53:01 PM CEST 2023] VmLck:         32 kB
[Wed Mar 29 09:54:01 PM CEST 2023] VmLck:         32 kB
[Wed Mar 29 09:55:01 PM CEST 2023] VmLck:         32 kB
[Wed Mar 29 09:56:01 PM CEST 2023] VmLck:         32 kB
[Wed Mar 29 09:57:01 PM CEST 2023] VmLck:         32 kB
[Wed Mar 29 09:58:01 PM CEST 2023] VmLck:         32 kB
[Wed Mar 29 09:59:01 PM CEST 2023] VmLck:         32 kB
[Wed Mar 29 10:00:01 PM CEST 2023] VmLck:         32 kB
[Wed Mar 29 10:01:01 PM CEST 2023] VmLck:         32 kB
[Wed Mar 29 10:02:01 PM CEST 2023] VmLck:         32 kB
[Wed Mar 29 10:03:01 PM CEST 2023] VmLck:         32 kB

The value change from 12kB to 32kB matches the beginning of the issue (Prometheus output has a 90s expiration_interval, this could be the extra minute with valid data)

srebhan commented 1 year ago

@N3v3R3nD and @Foxeronie can you please check the binary in PR #12988 once CI finished the tests and monitor the locked pages again!?!? Note: This PR removes locking of the buffer returned by Get(). This is insecure and not suitable for production as the secret might be paged out (to swap) and thus persisted on-disk. Use with care and only for non-critical systems!

Foxeronie commented 1 year ago

You are talking about the secrets that are used for the mysql connection, right?

Could it be a problem for the secret store (logic), that I don't use passwords? My mysql user can connect without a password since it is a read-only user with limited right on localhost only.

I will test your PR as soon it is ready.

srebhan commented 1 year ago

@Foxeronie yes I'm talking about the secrets for the mysql connection. Please note that every DSN is a secret now (even though it might not contain anything sensitive). Telegraf does not make any assumptions of the content other than it being a byte-sequence. So not having a password should not be a problem.

Btw, artifacts can be found here: https://app.circleci.com/pipelines/github/influxdata/telegraf/15623/workflows/9390b4f8-26f3-4fc4-9540-e137277566f5/jobs/245254/artifacts

Foxeronie commented 1 year ago

Ah okay, thanks for the explanation.

The artifact produced the same results. Metric collection stopped ~04:13

[...]
[Fri Mar 31 04:09:01 PM CEST 2023] VmLck:             12 kB
[Fri Mar 31 04:10:01 PM CEST 2023] VmLck:             12 kB
[Fri Mar 31 04:11:01 PM CEST 2023] VmLck:             12 kB
[Fri Mar 31 04:12:01 PM CEST 2023] VmLck:             12 kB
[Fri Mar 31 04:13:01 PM CEST 2023] VmLck:             32 kB
[Fri Mar 31 04:14:01 PM CEST 2023] VmLck:             32 kB
[Fri Mar 31 04:15:01 PM CEST 2023] VmLck:             32 kB
[Fri Mar 31 04:16:01 PM CEST 2023] VmLck:             32 kB
[Fri Mar 31 04:17:01 PM CEST 2023] VmLck:             32 kB
[...]

srebhan commented 1 year ago

@Foxeronie with the binary in #12993 you should see a warning if the limit for locked memory is too low. Can you please check your logs!?

Furthermore, if you use a inputs.sqlserver instance you also need #13002 which has a fix for this plugin.

Foxeronie commented 1 year ago

There are no new warnings in the logs. Sadly our config management software disabled debug mode. Hope the new message doesn't run under debug level. I enabled debug mode again and will give you an update otherwise.

2023-03-31T12:01:44Z I! Starting Telegraf 1.27.0-2be4d234
2023-03-31T12:01:44Z I! Available plugins: 235 inputs, 9 aggregators, 27 processors, 22 parsers, 57 outputs, 2 secret-stores
2023-03-31T12:01:44Z I! Loaded inputs: cpu disk diskio kernel mem mysql net processes swap system
2023-03-31T12:01:44Z I! Loaded aggregators:
2023-03-31T12:01:44Z I! Loaded processors:
2023-03-31T12:01:44Z I! Loaded secretstores:
2023-03-31T12:01:44Z I! Loaded outputs: graphite prometheus_client
2023-03-31T12:01:44Z I! Tags enabled: host=it-dads-glr-cmn-01
2023-03-31T12:01:44Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"it-dads-glr-cmn-01", Flush Interval:10s
2023-03-31T12:01:44Z I! [outputs.prometheus_client] Listening on http://[::]:9273/metrics
2023-03-31T20:11:00Z W! [inputs.mysql] Collection took longer than expected; not complete after interval of 1m0s
2023-03-31T20:12:00Z W! [inputs.mysql] Collection took longer than expected; not complete after interval of 1m0s
2023-03-31T20:13:00Z W! [inputs.mysql] Collection took longer than expected; not complete after interval of 1m0s

Furthermore, if you use a inputs.sqlserver instance you also need https://github.com/influxdata/telegraf/pull/13002 which has a fix for this plugin.

I only have inputs.mysql active.

N3v3R3nD commented 1 year ago

Hello,

I've been running the artifact from https://github.com/influxdata/telegraf/pull/13002 since Friday now. I can confirm the issue seems to be fixed, I have not had any gaps in my polling since then.

srebhan commented 1 year ago

@Foxeronie can you please check again with https://github.com/influxdata/telegraf/pull/13002 which should have all fixes included...

srebhan commented 1 year ago

@Foxeronie this is not the correct version... It should read 1.27.0~947190c8 or something like this.

Foxeronie commented 1 year ago

@Foxeronie this is not the correct version... It should read 1.27.0~947190c8 or something like this.

I was writing my post, while you already asked for testing the other artifact. :) Thats why I deleted my comment.

:~# telegraf version
Telegraf 1.27.0-947190c8 (git: pull/13002@947190c8)

Tests already started. 👍

srebhan commented 1 year ago

Released the fixes with 1.26.1. Will close this for now, @Foxeronie if you are still having trouble please reopen!

influxdata / telegraf