influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.59k stars 5.56k forks source link

When shutting down the server and start again, telegraf executed using systemd cannot read `/etc/telegraf/telegraf.d` unless root user is used #14130

Closed 4strodev closed 1 year ago

4strodev commented 1 year ago

Relevant telegraf.conf

# Configuration for telegraf agent
[agent]
  ## Default data collection interval for all inputs
  interval = "10s"
  ## Rounds collection interval to 'interval'
  ## ie, if interval="10s" then always collect on :00, :10, :20, etc.
  round_interval = true

  ## Telegraf will send metrics to outputs in batches of at most
  ## metric_batch_size metrics.
  ## This controls the size of writes that Telegraf sends to output plugins.
  metric_batch_size = 1000

  ## Maximum number of unwritten metrics per output.  Increasing this value
  ## allows for longer periods of output downtime without dropping metrics at the
  ## cost of higher maximum memory usage.
  metric_buffer_limit = 10000

  ## Collection jitter is used to jitter the collection by a random amount.
  ## Each plugin will sleep for a random time within jitter before collecting.
  ## This can be used to avoid many plugins querying things like sysfs at the
  ## same time, which can have a measurable effect on the system.
  collection_jitter = "0s"

  ## Default flushing interval for all outputs. Maximum flush_interval will be
  ## flush_interval + flush_jitter
  flush_interval = "10s"
  ## Jitter the flush interval by a random amount. This is primarily to avoid
  ## large write spikes for users running a large number of telegraf instances.
  ## ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s
  flush_jitter = "0s"

  ## By default or when set to "0s", precision will be set to the same
  ## timestamp order as the collection interval, with the maximum being 1s.
  ##   ie, when interval = "10s", precision will be "1s"
  ##       when interval = "250ms", precision will be "1ms"
  ## Precision will NOT be used for service inputs. It is up to each individual
  ## service input to set the timestamp at the appropriate precision.
  ## Valid time units are "ns", "us" (or "µs"), "ms", "s".
  precision = ""

  ## Log at debug level.
  # debug = false
  ## Log only error level messages.
  # quiet = false

  ## Log target controls the destination for logs and can be one of "file",
  ## "stderr" or, on Windows, "eventlog".  When set to "file", the output file
  ## is determined by the "logfile" setting.
  # logtarget = "file"

  ## Name of the file to be logged to when using the "file" logtarget.  If set to
  ## the empty string then logs are written to stderr.
  # logfile = ""

  ## The logfile will be rotated after the time interval specified.  When set
  ## to 0 no time based rotation is performed.  Logs are rotated only when
  ## written to, if there is no log activity rotation may be delayed.
  # logfile_rotation_interval = "0d"

  ## The logfile will be rotated when it becomes larger than the specified
  ## size.  When set to 0 no size based rotation is performed.
  # logfile_rotation_max_size = "0MB"

  ## Maximum number of rotated archives to keep, any older logs are deleted.
  ## If set to -1, no archives are removed.
  # logfile_rotation_max_archives = 5

  ## Pick a timezone to use when logging or type 'local' for local time.
  ## Example: America/Chicago
  # log_with_timezone = ""

  ## Override default hostname, if empty use os.Hostname()
  hostname = ""
  ## If set to true, do no set the "host" tag in the telegraf agent.
  omit_hostname = false
[[outputs.influxdb_v2]]
  ## The URLs of the InfluxDB cluster nodes.
  ##
  ## Multiple URLs can be specified for a single cluster, only ONE of the
  ## urls will be written to each interval.
  ##   ex: urls = ["https://us-west-2-1.aws.cloud2.influxdata.com"]
  urls = ["http://192.168.56.102:8086"]

  ## Token for authentication.
  token = "<Hardcoded token>"

  ## Organization is the name of the organization you wish to write to; must exist.
  organization = "zertifier"

  ## Destination bucket to write into.
  bucket = "main"

  ## The value of this tag will be used to determine the bucket.  If this
  ## tag is not set the 'bucket' option is used as the default.
  # bucket_tag = ""

  ## If true, the bucket tag will not be added to the metric.
  # exclude_bucket_tag = false

  ## Timeout for HTTP messages.
  # timeout = "5s"

  ## Additional HTTP headers
  # http_headers = {"X-Special-Header" = "Special-Value"}

  ## HTTP Proxy override, if unset values the standard proxy environment
  ## variables are consulted to determine which proxy, if any, should be used.
  # http_proxy = "http://corporate.proxy:3128"

  ## HTTP User-Agent
  # user_agent = "telegraf"

  ## Content-Encoding for write request body, can be set to "gzip" to
  ## compress body or "identity" to apply no encoding.
  # content_encoding = "gzip"

  ## Enable or disable uint support for writing uints influxdb 2.0.
  # influx_uint_support = false

  ## Optional TLS Config for use on HTTP connections.
  # tls_ca = "/etc/telegraf/ca.pem"
  # tls_cert = "/etc/telegraf/cert.pem"
  # tls_key = "/etc/telegraf/key.pem"
  ## Use TLS but skip chain & host verification
  # insecure_skip_verify = false
[[inputs.syslog]]
  ## Protocol, address and port to host the syslog receiver.
  ## If no host is specified, then localhost is used.
  ## If no port is specified, 6514 is used (RFC5425#section-4.1).
  ##   ex: server = "tcp://localhost:6514"
  ##       server = "udp://:6514"
  ##       server = "unix:///var/run/telegraf-syslog.sock"
  server = "tcp://:6514"

  ## TLS Config
  # tls_allowed_cacerts = ["/etc/telegraf/ca.pem"]
  # tls_cert = "/etc/telegraf/cert.pem"
  # tls_key = "/etc/telegraf/key.pem"

  ## Period between keep alive probes.
  ## 0 disables keep alive probes.
  ## Defaults to the OS configuration.
  ## Only applies to stream sockets (e.g. TCP).
  # keep_alive_period = "5m"

  ## Maximum number of concurrent connections (default = 0).
  ## 0 means unlimited.
  ## Only applies to stream sockets (e.g. TCP).
  # max_connections = 1024

  ## Read timeout is the maximum time allowed for reading a single message (default = 5s).
  ## 0 means unlimited.
  # read_timeout = "5s"

  ## The framing technique with which it is expected that messages are transported (default = "octet-counting").
  ## Whether the messages come using the octect-counting (RFC5425#section-4.3.1, RFC6587#section-3.4.1),
  ## or the non-transparent framing technique (RFC6587#section-3.4.2).
  ## Must be one of "octect-counting", "non-transparent".
  # framing = "octet-counting"

  ## The trailer to be expected in case of non-transparent framing (default = "LF").
  ## Must be one of "LF", or "NUL".
  # trailer = "LF"

  ## Whether to parse in best effort mode or not (default = false).
  ## By default best effort parsing is off.
  # best_effort = false

  ## The RFC standard to use for message parsing
  ## By default RFC5424 is used. RFC3164 only supports UDP transport (no streaming support)
  ## Must be one of "RFC5424", or "RFC3164".
  # syslog_standard = "RFC5424"

  ## Character to prepend to SD-PARAMs (default = "_").
  ## A syslog message can contain multiple parameters and multiple identifiers within structured data section.
  ## Eg., [id1 name1="val1" name2="val2"][id2 name1="val1" nameA="valA"]
  ## For each combination a field is created.
  ## Its name is created concatenating identifier, sdparam_separator, and parameter name.
  # sdparam_separator = "_"

Logs from Telegraf

Oct 17 12:32:50 localhost.localdomain systemd[1]: telegraf.service: Scheduled restart job, restart counter is at 5.
Oct 17 12:32:50 localhost.localdomain systemd[1]: Stopped Telegraf.
Oct 17 12:32:50 localhost.localdomain systemd[1]: telegraf.service: Start request repeated too quickly.
Oct 17 12:32:50 localhost.localdomain systemd[1]: telegraf.service: Failed with result 'exit-code'.
Oct 17 12:32:50 localhost.localdomain systemd[1]: Failed to start Telegraf.
Oct 17 12:38:50 localhost.localdomain systemd[1]: Starting Telegraf...
Oct 17 12:38:50 localhost.localdomain telegraf[1550]: 2023-10-17T10:38:50Z W! Telegraf is not permitted to read /etc/telegraf/telegraf.d
Oct 17 12:38:50 localhost.localdomain telegraf[1550]: 2023-10-17T10:38:50Z I! Loading config: /etc/telegraf/telegraf.conf

System info

1.28.2, Rocky Linux 9.2

Docker

No response

Steps to reproduce

  1. Install telegraf following the documentation
  2. setup an output to influx
  3. db set inputs for syslogs following this link
  4. Setup grafana dashboard
  5. Once all is working shutdown the machine and start again
  6. Now telegraf cannot read /etc/telegraf/telegraf.d directory unless you use root as user to run the program on the systemd service file.

Expected behavior

The expected behavior is that telegraf can read the directory as it could when is installed.

Actual behavior

When I shut down and start again the machine, telegraf only can read config directory if root is used.

Additional info

I tried to put all permissions on this directory and subdirectories but the same happens. I use su telegraf the system don't log me as this user. Only can read directory if root user is used (I didn't try with more users, only with root). Note in this case I hard-coded the InfluxDB token to the config file.

powersj commented 1 year ago

telegraf only can read config directory if root is used.

You did not provide what permissions you have on these files and folders as well. By default, the Telegraf service runs as the telegraf user.

I tried to put all permissions on this directory and subdirectories but the same happens

You should not have to change anything. I am unable to reproduce, here is my install of telegraf:

[root@r9 ~]# cat /etc/os-release 
NAME="Rocky Linux"
VERSION="9.2 (Blue Onyx)"
ID="rocky"
ID_LIKE="rhel centos fedora"
VERSION_ID="9.2"
PLATFORM_ID="platform:el9"
PRETTY_NAME="Rocky Linux 9.2 (Blue Onyx)"
ANSI_COLOR="0;32"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:rocky:rocky:9::baseos"
HOME_URL="https://rockylinux.org/"
BUG_REPORT_URL="https://bugs.rockylinux.org/"
SUPPORT_END="2032-05-31"
ROCKY_SUPPORT_PRODUCT="Rocky-Linux-9"
ROCKY_SUPPORT_PRODUCT_VERSION="9.2"
REDHAT_SUPPORT_PRODUCT="Rocky Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="9.2"
[root@r9 ~]# cat <<EOF | sudo tee /etc/yum.repos.d/influxdata.repo
[influxdata]
name = InfluxData Repository - Stable
baseurl = https://repos.influxdata.com/stable/\$basearch/main
enabled = 1
gpgcheck = 1
gpgkey = https://repos.influxdata.com/influxdata-archive_compat.key
EOF
[influxdata]
name = InfluxData Repository - Stable
baseurl = https://repos.influxdata.com/stable/$basearch/main
enabled = 1
gpgcheck = 1
gpgkey = https://repos.influxdata.com/influxdata-archive_compat.key
[root@r9 ~]# sudo yum install telegraf
InfluxData Repository - Stable                                                  158 kB/s |  54 kB     00:00    
Rocky Linux 9 - BaseOS                                                          2.3 MB/s | 1.9 MB     00:00    
Rocky Linux 9 - AppStream                                                       2.1 MB/s | 7.1 MB     00:03    
Rocky Linux 9 - Extras                                                           19 kB/s |  11 kB     00:00    
Dependencies resolved.
================================================================================================================
 Package                   Architecture            Version                    Repository                   Size
================================================================================================================
Installing:
 telegraf                  x86_64                  1.28.2-1                   influxdata                   52 M

Transaction Summary
================================================================================================================
Install  1 Package

Total download size: 52 M
Installed size: 194 M
Is this ok [y/N]: y
Downloading Packages:
telegraf-1.28.2-1.x86_64.rpm                                                     54 MB/s |  52 MB     00:00    
----------------------------------------------------------------------------------------------------------------
Total                                                                            54 MB/s |  52 MB     00:00     
InfluxData Repository - Stable                                                   16 kB/s | 1.6 kB     00:00    
Importing GPG key 0x7DF8B07E:
 Userid     : "InfluxData Package Signing Key <support@influxdata.com>"
 Fingerprint: 9D53 9D90 D332 8DC7 D6C8 D3B9 D8FF 8E1F 7DF8 B07E
 From       : https://repos.influxdata.com/influxdata-archive_compat.key
Is this ok [y/N]: y
Key imported successfully
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
  Preparing        :                                                                                        1/1 
  Running scriptlet: telegraf-1.28.2-1.x86_64                                                               1/1 
  Installing       : telegraf-1.28.2-1.x86_64                                                               1/1 
  Running scriptlet: telegraf-1.28.2-1.x86_64                                                               1/1 
Created symlink /etc/systemd/system/multi-user.target.wants/telegraf.service → /usr/lib/systemd/system/telegraf.service.

  Verifying        : telegraf-1.28.2-1.x86_64                                                               1/1 

Installed:
  telegraf-1.28.2-1.x86_64                                                                                      

Complete!
[root@r9 ~]# ls -la /etc/telegraf/
total 492
drwxr-xr-x 1 root root     46 Oct 17 12:53 .
drwxr-xr-x 1 root root   2302 Oct 17 12:53 ..
-rw-r--r-- 1 root root 497246 Oct  2 19:03 telegraf.conf
drwxr-xr-x 1 root root     14 Oct 17 12:53 telegraf.d

Set up the config:

[root@r9 ~]# echo -e "[[inputs.mem]]\n[[outputs.file]]" > /etc/telegraf/telegraf.conf
[root@r9 ~]# cat /etc/telegraf/telegraf.conf 
[[inputs.mem]]
[[outputs.file]]
[root@r9 ~]# echo "[[inputs.cpu]]" > /etc/telegraf/telegraf.d/inputs.conf
[root@r9 ~]# cat /etc/telegraf/telegraf.d/inputs.conf 
[[inputs.cpu]]

Start the service:

[root@r9 ~]# systemctl start telegraf
[root@r9 ~]# journalctl --no-pager --unit telegraf
Oct 17 12:55:41 r9 systemd[1]: Starting Telegraf...
Oct 17 12:55:42 r9 telegraf[612]: 2023-10-17T12:55:42Z I! Loading config: /etc/telegraf/telegraf.conf
Oct 17 12:55:42 r9 telegraf[612]: 2023-10-17T12:55:42Z I! Loading config: /etc/telegraf/telegraf.d/inputs.conf
Oct 17 12:55:42 r9 telegraf[612]: 2023-10-17T12:55:42Z I! Starting Telegraf 1.28.2 brought to you by InfluxData the makers of InfluxDB
Oct 17 12:55:42 r9 telegraf[612]: 2023-10-17T12:55:42Z I! Available plugins: 240 inputs, 9 aggregators, 29 processors, 24 parsers, 59 outputs, 5 secret-stores
Oct 17 12:55:42 r9 telegraf[612]: 2023-10-17T12:55:42Z I! Loaded inputs: cpu mem
Oct 17 12:55:42 r9 telegraf[612]: 2023-10-17T12:55:42Z I! Loaded aggregators:
Oct 17 12:55:42 r9 telegraf[612]: 2023-10-17T12:55:42Z I! Loaded processors:
Oct 17 12:55:42 r9 telegraf[612]: 2023-10-17T12:55:42Z I! Loaded secretstores:
Oct 17 12:55:42 r9 telegraf[612]: 2023-10-17T12:55:42Z I! Loaded outputs: file
Oct 17 12:55:42 r9 telegraf[612]: 2023-10-17T12:55:42Z I! Tags enabled: host=r9
Oct 17 12:55:42 r9 telegraf[612]: 2023-10-17T12:55:42Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"r9", Flush Interval:10s
Oct 17 12:55:42 r9 systemd[1]: Started Telegraf.
Oct 17 12:55:52 r9 telegraf[612]: mem,host=r9 write_back=0i,buffered=0i,commit_limit=37962133504i,page_tables=47833088i,shared=8503296i,swap_free=0i,vmalloc_chunk=0i,vmalloc_total=35184372087808i,huge_page_size=2097152i,low_free=0i,low_total=0i,sreclaimable=0i,available=67260375040i,free=66888650752i,slab=0i,write_back_tmp=0i,committed_as=16049537024i,used=73965568i,dirty=0i,huge_pages_free=0i,sunreclaim=0i,swap_cached=0i,vmalloc_used=190316544i,used_percent=0.1098482101883272,active=157577216i,cached=371724288i,high_free=0i,huge_pages_total=0i,total=67334340608i,inactive=270213120i,swap_total=0i,available_percent=99.89015178981167,high_total=0i,mapped=0i 1697547350000000000
Oct 17 12:56:02 r9 telegraf[612]: mem,host=r9 used_percent=0.10914257321956843,available_percent=99.89085742678043,cached=371724288i,huge_pages_total=0i,sunreclaim=0i,swap_free=0i,swap_total=0i,write_back_tmp=0i,active=157663232i,buffered=0i,write_back=0i,used=73490432i,free=66889125888i,shared=8503296i,slab=0i,low_total=0i,vmalloc_used=190332928i,total=67334340608i,inactive=270213120i,page_tables=47906816i,available=67260850176i,huge_pages_free=0i,vmalloc_chunk=0i,vmalloc_total=35184372087808i,committed_as=16085991424i,dirty=0i,mapped=0i,sreclaimable=0i,commit_limit=37962133504i,high_free=0i,high_total=0i,huge_page_size=2097152i,low_free=0i,swap_cached=0i 1697547360000000000
Oct 17 12:56:02 r9 telegraf[612]: cpu,cpu=cpu0,host=r9 usage_idle=99.79979979979998,usage_nice=0,usage_iowait=0,usage_guest_nice=0,usage_user=0.10010010010010677,usage_system=0,usage_irq=0,usage_softirq=0.10010010010010009,usage_steal=0,usage_guest=0 1697547360000000000

Then do a restart:

[root@r9 ~]# uptime
 12:59:32 up 0 min,  0 users,  load average: 3.29, 3.18, 2.53
[root@r9 ~]# systemctl start telegraf
[root@r9 ~]# systemctl status telegraf
● telegraf.service - Telegraf
     Loaded: loaded (/usr/lib/systemd/system/telegraf.service; enabled; preset: disabled)
    Drop-In: /run/systemd/system/service.d
             └─zzz-lxc-service.conf
     Active: active (running) since Tue 2023-10-17 12:59:22 UTC; 26s ago
       Docs: https://github.com/influxdata/telegraf
   Main PID: 165 (telegraf)
      Tasks: 17 (limit: 410848)
     Memory: 36.6M
        CPU: 58ms
     CGroup: /system.slice/telegraf.service
             └─165 /usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d

Oct 17 12:59:42 r9 telegraf[165]: cpu,cpu=cpu23,host=r9 usage_idle=100,usage_guest=0,usage_guest_nice=0,usage_user=0,usage_system=0,usage_nice=0,usage_iowait=0,usage_irq=0,usage_softirq=0,usage_steal=0 1697547580000000000
Oct 17 12:59:42 r9 telegraf[165]: cpu,cpu=cpu24,host=r9 usage_nice=0,usage_iowait=0,usage_softirq=0,usage_guest=0,usage_user=0,usage_idle=99.89979959919849,usage_irq=0.1002004008016054,usage_steal=0,usage_guest_nice=0,usage_system=0 1697547580000000000
4strodev commented 1 year ago

You did not provide what permissions you have on these files and folders as well. By default, the Telegraf service runs as the telegraf user.

It doesn't matter, I tried even setting full permissions (777) and has the same error message. I was trying to set up a syslog monitor. And I set up two vm using VirtualBox. The problem comes when I shut down the VMs and later when I returned to test the monitor telegraf don't work.

Also, after trying to change permissions daemon settings, etc. I tried to log in as telegraf using su command, and it was not possible. May the shutdown for some reason broke the user data or something else.

For the moment I changed the user to root and it works. But why this happened?

powersj commented 1 year ago

I tried to log in as telegraf using su command, and it was not possible.

That is because the telegraf user is not a normal user. It is a system user.

For the moment I changed the user to root and it works

This is cardinally less than ideal to run as root.

You have not provided any details about what is wrong with the install or how I can reproduce. I am going to close this, but if you can demonstrate a way to get into your situation and a real issue please re-open and provide those details.

Thanks