manala / ansible-roles

Manala ansible roles
https://galaxy.ansible.com/manala/
MIT License
144 stars 36 forks source link

[manala.telegraf] service unable to start during initial provisioning #650

Open lisuml opened 1 year ago

lisuml commented 1 year ago

manala.roles version: 3.2.0

During an initial provisioning of the node with manala.telegraf role attached, the service is not being started properly:

TASK [manala.roles.telegraf : Configs > Templates present] ****************************************************************************************************************************************************************************************************
changed: [d-test.euc1.XXX.lan] => (item={'state': 'present', 'template': 'configs/_default.j2', 'file': '/etc/telegraf/telegraf.d/os.conf', 'config': '[[inputs.cpu]]\n  totalcpu = true\n[[inputs.disk]]\n  ignore_fs = ["tmpfs", "devtmpfs", "devfs", "iso9660", "overlay", "aufs", "squashfs"]\n[[inputs.diskio]]\n[[inputs.kernel]]\n[[inputs.mem]]\n[[inputs.net]]\n[[inputs.netstat]]\n[[inputs.processes]]\n[[inputs.system]]\n'})
changed: [d-test.euc1.XXX.lan] => (item={'state': 'present', 'template': 'configs/_default.j2', 'file': '/etc/telegraf/telegraf.d/output.conf', 'config': '[[outputs.influxdb]]\n  urls = [ "udp://metrix.euc1.XXX.lan:8089" ]\n  udp_payload = "1024B"\n'})

TASK [manala.roles.telegraf : Configs > Files absent] *********************************************************************************************************************************************************************************************************

TASK [manala.roles.telegraf : Services > Services] ************************************************************************************************************************************************************************************************************
failed: [d-test.euc1.XXX.lan] (item=telegraf) => {"ansible_loop_var": "item", "changed": false, "item": "telegraf", "msg": "Unable to start service telegraf: Job for telegraf.service failed because the control process exited with error code.\nSee \"systemctl status telegraf.service\" and \"journalctl -xe\" for details.\n"}

As you can see, the configs are defined properly, but it seems they are not ready on service start. The error I see in systemd:

Jan 17 13:40:15 d-test.euc1.XXX.lan telegraf[8968]: 2023-01-17T13:40:15Z E! [telegraf] Error running agent: no outputs found, did you provide a valid config file?
Jan 17 13:40:15 d-test.euc1.XXX.lan systemd[1]: telegraf.service: Main process exited, code=exited, status=1/FAILURE

During the 2nd provisioning attempt, the error is gone and the service starts normally.

lisuml commented 1 year ago

More investigation made and it seems the issue is only present with telegraf 1.25.0 (most recent one at the moment).

The issue is caused by the fact, the official debian packages provided by influxdata automatically try to start the telegraf systemd service on installation time and the working configuration for the outputs is expected to be part of the config file at that time, but the outputs configuration is not there.

This looks like a bug of telegraf itself or/and telegraf official debian packages. I'm going to file an github issue on the official telegraf repository.

For me, the workaround was simply to pick lower version of the telegraf to install from ansible playbook:

manala_telegraf_install_packages_default:
      - telegraf=1.24.4-1
nervo commented 1 year ago

@lisuml we ran on the same issue on v1.25.0 and fixed our tests like that https://github.com/manala/ansible-roles/pull/642

Would you provide all your values passed to the role ?

btw, use manala_telegraf_install_packages instead of manala_telegraf_install_packages_default:)

lisuml commented 1 year ago

@nervo: thanks for the followup!

Would you provide all your values passed to the role ?

These are my ansible variables:

    manala_apt_preferences:
      - influxdb@influxdata
    manala_telegraf_install_packages:
      - telegraf=1.24.4-1
    manala_telegraf_config_template: config/telegraf/base/telegraf.conf.j2
    manala_telegraf_config:
      global_tags:
        environment: "{{ env }}"
    manala_telegraf_configs:
      - file: os.conf
        config: |
          [[inputs.cpu]]
            totalcpu = true
          [[inputs.disk]]
            ignore_fs = ["tmpfs", "devtmpfs", "devfs", "iso9660", "overlay", "aufs", "squashfs"]
          [[inputs.diskio]]
          [[inputs.kernel]]
          [[inputs.mem]]
          [[inputs.net]]
          [[inputs.netstat]]
          [[inputs.processes]]
          [[inputs.system]]
      - file: output.conf
        config: |
          [[outputs.influxdb]]
            urls = [ "udp://metrix.euc1.XXX.lan:8089" ]
            udp_payload = "1024B"

use manala_telegraf_install_packages instead of manala_telegraf_install_packages_default

Roger that.

FYI: I created an issue in telegraf github repo: https://github.com/influxdata/telegraf/issues/12514

nervo commented 1 year ago

Ok, so let's wait for the next telegraf version :)

(btw, you should also use explicit telegraf apt preference)

        manala_apt_preferences:
          - telegraf@influxdata
lisuml commented 1 year ago

(btw, you should also use explicit telegraf apt preference)

My bad. Thanks for pointing this out!