influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.53k stars 5.56k forks source link

Telegraf does not start automatically after update #12610

Closed KevinAAl closed 1 year ago

KevinAAl commented 1 year ago

Relevant telegraf.conf

root@xxxxxxxxxx [xxxxxxx]:~ # cat /etc/telegraf/telegraf.conf
[tags]

# Configuration for telegraf agent
[agent]
  debug = false
  flush_buffer_when_full = true
  flush_interval = "15s"
  flush_jitter = "0s"
  hostname = "xxxxxxxxxxxxx"
  interval = "15s"
  round_interval = true

Logs from Telegraf

Feb  3 06:04:34 xxxxx influxd-systemd-start.sh[2481367]: [httpd] 127.0.0.1 - telegraf [03/Feb/2023:06:04:34 +0100] "POST /write?db=telegraf HTTP/1.1 " 204 0 "-" "Telegraf/1.25.0 Go/1.19.4" 4044a6a5-a380-11ed-a9
35-0050569fd1af 4593
Feb  3 06:04:42 xxxxx systemd[1]: Stopping Telegraf...
Feb  3 06:04:42 xxxxx telegraf[272975]: 2023-02-03T05:04:42Z I! [agent] Hang on, flushing any cached metrics before shutdown
Feb  3 06:04:42 xxxxx telegraf[272975]: 2023-02-03T05:04:42Z I! [agent] Stopping running outputs
Feb  3 06:04:42 xxxxx systemd[1]: telegraf.service: Succeeded.
Feb  3 06:04:42 xxxxx systemd[1]: Stopped Telegraf.
Feb  3 06:04:42 xxxxx systemd[1]: telegraf.service: Consumed 1h 38min 16.933s CPU time.
Feb  3 06:04:43 xxxxx systemd[1]: Reloading.
Feb  3 06:04:43 xxxxx systemd[1]: Reloading.
Feb  3 06:04:44 xxxxx systemd[1]: apt-daily-upgrade.service: Succeeded.
Feb  3 06:04:44 xxxxx systemd[1]: Finished Daily apt upgrade and clean activities.
Feb  3 06:04:44 xxxxx systemd[1]: apt-daily-upgrade.service: Consumed 48.773s CPU time.

System info

Debian 11.6 with SystemD / Telegraf 1.25.1 (git: HEAD@e1a0d74e)

Docker

I don't use Docker.

Steps to reproduce

  1. Install Telegraf 1.25.0-1 (telegraf package) from deposit repos.influxdata.com
  2. Start Telegraf service
  3. Update Telegraf 1.25.0-1 to 1.25.1-1 from deposit repos.influxdata.com manualy or with unattended-upgrade

Expected behavior

Telegraf restarts automatically after update

Actual behavior

Telegraf does not start automatically after update

Additional info

To fix, I only have to start Telegraf manually after update:

service telegraf start

No error after start:

root@xxxxxxxxxxx [xxxxxxxxxxx]:~ # service telegraf status
● telegraf.service - Telegraf
     Loaded: loaded (/lib/systemd/system/telegraf.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2023-02-02 14:16:13 CET; 19h ago
       Docs: https://github.com/influxdata/telegraf
   Main PID: 7906 (telegraf)
      Tasks: 24 (limit: 76965)
     Memory: 33.2M
        CPU: 1min 32.241s
     CGroup: /system.slice/telegraf.service
             └─7906 /usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d

févr. 02 14:16:13 xxxxxx telegraf[7906]: 2023-02-02T13:16:13Z I! Available plugins: 228 inputs, 9 aggregators, 26 processors, 21 parsers, 57 outputs, 2 secret-stores
févr. 02 14:16:13 xxxxxx telegraf[7906]: 2023-02-02T13:16:13Z I! Loaded inputs: cpu disk diskio mem net netstat postfix sensors swap system
févr. 02 14:16:13 xxxxxx telegraf[7906]: 2023-02-02T13:16:13Z I! Loaded aggregators:
févr. 02 14:16:13 xxxxxx telegraf[7906]: 2023-02-02T13:16:13Z I! Loaded processors:
févr. 02 14:16:13 xxxxxx telegraf[7906]: 2023-02-02T13:16:13Z I! Loaded secretstores:
févr. 02 14:16:13 xxxxxx telegraf[7906]: 2023-02-02T13:16:13Z I! Loaded outputs: influxdb
févr. 02 14:16:13 xxxxxx telegraf[7906]: 2023-02-02T13:16:13Z I! Tags enabled: host=xxxxxxxxxxxxxxxxx
févr. 02 14:16:13 xxxxxx telegraf[7906]: 2023-02-02T13:16:13Z W! Deprecated outputs: 0 and 1 options
févr. 02 14:16:13 xxxxxx telegraf[7906]: 2023-02-02T13:16:13Z I! [agent] Config: Interval:15s, Quiet:false, Hostname:"xxxxxxxxxxxxxxxx", Flush Interval:15s
févr. 02 14:16:13 xxxxxx systemd[1]: Started Telegraf.

Log from unattended-upgrade (/var/log/apt/history.log) :

Start-Date: 2023-02-03 06:04:42 Commandline: /usr/bin/unattended-upgrade Upgrade: telegraf:amd64 (1.25.0-1, 1.25.1-1) End-Date: 2023-02-03 06:04:44

I think the problem comes from the update of the postinst in the telegraf package. Before, the postinst restarted Telegraf as follows:

if [[ "$(readlink /proc/1/exe)" == */systemd ]]; then
    install_systemd /lib/systemd/system/telegraf.service
    deb-systemd-invoke restart telegraf.service || echo "WARNING: systemd not running."
else
    # Assuming SysVinit
    install_init
    # Run update-rc.d or fallback to chkconfig if not available
    if which update-rc.d &>/dev/null; then
        install_update_rcd
    else
        install_chkconfig
    fi
    invoke-rc.d telegraf restart
fi

On the last package, the script is as follows:

if [ -d /run/systemd/system ]; then
    install_systemd /lib/systemd/system/telegraf.service
    # if and only if the service was already running then restart
    deb-systemd-invoke try-restart telegraf.service >/dev/null || true
else
    # Assuming SysVinit
    install_init
    # Run update-rc.d or fallback to chkconfig if not available
    if which update-rc.d &>/dev/null; then
        install_update_rcd
    else
        install_chkconfig
    fi
    invoke-rc.d telegraf restart
fi

The command deb-systemd-invoke restart telegraf.service restart Telegraf or start Telegraf if it is stopped. The command deb-systemd-invoke try-restart telegraf.service only restart telegraf but does not start telegraf if it is stopped. As Telegraf is automatically stopped at the start of the update, it never restarts.

Currently we have to start Telegraf on all of our servers after update and we lose metrics in between. Can you restore the original configuration?

Thanks

powersj commented 1 year ago

Telegraf does not start automatically after update

First some background, the telegraf deb no longer starts the service on the initial install. The provided config does not work out of the box and was never valid. It was weird to start the service automatically when nothing would work and then systemd would report a failed unit when the user had not had an opportunity to configure it. The change also aligns the behavior with the rpm packaging, which does not automatically start the service either.

That change made in v1.25.0, meant that when you upgraded from v1.25.0 (and only this version), the service would be stopped and not start again on an upgrade. As noted, this was the wrong behavior. A second change to not stop the service on upgrade and only start the service if it was already running went in and will apply to the upgrade of v1.25.1 to the next version. You can verify this on a system running v1.25.1 and upgrade to a nightly build.

fatalbyte commented 1 year ago

I'm also having this same issue since upgrading from 1.24.3-1 to 1.25.1-1.

telegraf-tiger[bot] commented 1 year ago

Hello! I am closing this issue due to inactivity. I hope you were able to resolve your problem, if not please try posting this question in our Community Slack or Community Page. Thank you!