influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.5k stars 5.55k forks source link

CPU load due to Telegraf agent #743

Closed Dual-Boot closed 8 years ago

Dual-Boot commented 8 years ago

Hello,

I use telegraf as the collector agent for servers and since the lastest stable version, I have a strange behaviour :

Regards,

sparrc commented 8 years ago

Some more details would be helpful:

  1. Config file
  2. Telegraf version
  3. Operating system(s)
  4. Is it different on previous versions?
Dual-Boot commented 8 years ago

Hi,

Sorry, I should have precised them : 1 - config file :

# Telegraf configuration

# Telegraf is entirely plugin driven. All metrics are gathered from the
# declared inputs, and sent to the declared outputs.

# Plugins must be declared in here to be active.
# To deactivate a plugin, comment out the name and any variables.

# Use 'telegraf -config telegraf.conf -test' to see what metrics a config
# file would generate.

# Global tags can be specified here in key="value" format.
[tags]
  # dc = "us-east-1" # will tag all metrics with dc=us-east-1
  # rack = "1a"

# Configuration for telegraf agent
[agent]
  # Default data collection interval for all plugins
  interval = "10s"
  # Rounds collection interval to 'interval'
  # ie, if interval="10s" then always collect on :00, :10, :20, etc.
  round_interval = true

  # Default data flushing interval for all outputs. You should not set this below
  # interval. Maximum flush_interval will be flush_interval + flush_jitter
  flush_interval = "10s"
  # Jitter the flush interval by a random amount. This is primarily to avoid
  # large write spikes for users running a large number of telegraf instances.
  # ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s
  flush_jitter = "0s"

  # Run telegraf in debug mode
  debug = false
  # Override default hostname, if empty use os.Hostname()
  hostname = ""

###############################################################################
#                                  OUTPUTS                                    #
###############################################################################

# Configuration for influxdb server to send metrics to
[[outputs.influxdb]]
  # The full HTTP or UDP endpoint URL for your InfluxDB instance.
  # Multiple urls can be specified but it is assumed that they are part of the same
  # cluster, this means that only ONE of the urls will be written to each interval.
  # urls = ["udp://localhost:8089"] # UDP endpoint example
  #urls = ["http://localhost:8086"] # required
  urls = ["http://srv-influxdb01.mydomain.local:8086"] # required
  # The target database for metrics (telegraf will create it if not exists)
  #database = "telegraf" # required
  database = "db_telegraf01" # required
  # Precision of writes, valid values are n, u, ms, s, m, and h
  # note: using second precision greatly helps InfluxDB compression
  precision = "s"

  # Connection timeout (for the connection with InfluxDB), formatted as a string.
  # If not provided, will default to 0 (no timeout)
  # timeout = "5s"
  # username = "telegraf"
  # password = "metricsmetricsmetricsmetrics"
  # Set the user agent for HTTP POSTs (can be useful for log differentiation)
  user_agent = "telegraf"
  # Set UDP payload size, defaults to InfluxDB UDP Client default (512 bytes)
  # udp_payload = 512

###############################################################################
#                                  INPUTS                                     #
###############################################################################

# Read metrics about cpu usage
[[inputs.cpu]]
  # Whether to report per-cpu stats or not
  percpu = true
  # Whether to report total system cpu stats or not
  totalcpu = true
  # Comment this line if you want the raw CPU time metrics
  drop = ["time_*"]

# Read metrics about disk usage by mount point
[[inputs.disk]]
  # By default, telegraf gather stats for all mountpoints.
  # Setting mountpoints will restrict the stats to the specified mountpoints.
  # mount_points=["/"]

# Read metrics about disk IO by device
[[inputs.diskio]]
  # By default, telegraf will gather stats for all devices including
  # disk partitions.
  # Setting devices will restrict the stats to the specified devices.
  # devices = ["sda", "sdb"]
  # Uncomment the following line if you do not need disk serial numbers.
  # skip_serial_number = true

# Read metrics about memory usage
[[inputs.mem]]
  # no configuration

# Read metrics about swap memory usage
[[inputs.swap]]
  # no configuration

# Read metrics about system load & uptime
[[inputs.system]]
  # no configuration

# Read metrics about TCP status such as established, time wait etc and UDP sockets counts.
[[inputs.netstat]]
  # no configuration

# Read Nginx's basic status information (ngx_http_stub_status_module)
[[inputs.nginx]]
  # An array of Nginx stub_status URI to gather stats.
    urls = ["http://localhost:82/status"]

# Read Apache status information (mod_status)
[[inputs.apache]]
  # An array of Apache status URI to gather stats.
  urls = ["http://localhost/server-status?auto"]

# Monitor process cpu and memory usage
[[inputs.procstat]]
  # Must specify one of: pid_file, exe, or pattern
  # PID file to monitor process
  pid_file = "/var/run/nginx.pid"
  # executable name (ie, pgrep <exe>)
  # exe = "nginx"
  # pattern as argument for pgrep (ie, pgrep -f <pattern>)
  # pattern = "nginx"

  # Field name prefix
  prefix = ""

[[inputs.procstat]]
  pid_file = "/var/run/apache2/apache2.pid"

[[inputs.procstat]]
  pid_file = "/var/run/telegraf/telegraf.pid"
###############################################################################
#                              SERVICE INPUTS                                 #
###############################################################################

2 - Telegraf version :

aptitude show telegraf 
Paquet : telegraf                             
Nouveau: oui
État: installé
Automatiquement installé: non
Version : 0.10.3-1
Priorité : supplémentaire
Section : default
Responsable : support@influxdb.com
Architecture : amd64
Taille décompressée : 27,7 M
Est en conflit: telegraf
Description : Plugin-driven server agent for reporting metrics into InfluxDB.

Site : https://github.com/influxdata/telegraf

3 - OS : Ubuntu 14.04.4 LTS 4 - I do not remember

Dual-Boot commented 8 years ago

Lastest precision : go version go1.2.1 linux/amd64

sparrc commented 8 years ago

Could be from garbage collection, could you run with GOGC=off set?

do:

$ vim /etc/init.d/telegraf +141

and replace

nohup $daemon -pidfile $pidfile -config $config -config-directory $confdir $TELEGRAF_OPTS >>$STDOUT 2>>$STDERR &

with:

nohup GOGC=off $daemon -pidfile $pidfile -config $config -config-directory $confdir $TELEGRAF_OPTS >>$STDOUT 2>>$STDERR &
sparrc commented 8 years ago

BTW Are you building from source or using the .deb packages?

Dual-Boot commented 8 years ago

I will try to modify the init file. The telegraf program is installed from deb package, official influxdata repo.

Dual-Boot commented 8 years ago

I have done the modification, I will return soon. Thanks.

Dual-Boot commented 8 years ago

Nothing changes. Still have CPU load.

Dual-Boot commented 8 years ago

Hi,

For information, with the last update I still have the problem for one machine, which have been not reboot after new kernel version and telegraf version have been updated :

aptitude show telegraf 
Paquet : telegraf                             
Nouveau: oui
État: installé
Automatiquement installé: non
Version : 0.10.4.1-1
Priorité : supplémentaire
Section : default
Responsable : support@influxdb.com
Architecture : amd64
Taille décompressée : 29,1 M
Est en conflit: telegraf
Description : Plugin-driven server agent for reporting metrics into InfluxDB.

Site : https://github.com/influxdata/telegraf

Here a picture of the server CPU load behaviour after reboot : selection_034

Regards,

Dual-Boot commented 8 years ago

And now after rebooting with the new kernel in the second server srv-master01 (with just 2 main application Bind and puppet master with mod passenger enabeld and of course telegraf ;-) ) Reboot took place near 21h30 :

sparrc commented 8 years ago

thanks for the info @Dual-Boot, so what is the conclusion? It appears you are still getting CPU spikes but they are less severe?

Dual-Boot commented 8 years ago

For the moent it seems that the latest package has better performance in term of CPU usage and I am not really sure anymore, that it was link to kernel too.

For the moment none of my machine have risen alert. I want to wait some days to confirm the better improvement.

Regards,

Dual-Boot commented 8 years ago

So I'm back ;)

Well, since my upgrade and reboot it seems to be better :

titilambert commented 8 years ago

@Dual-Boot Could you confirm that v0.10.4 fixed the bug ?

Dual-Boot commented 8 years ago

I think for the moment, the performance are better with a decrease of the intensity of CPU spikes but it seems to restart again. selection_039

And compare with load on the same machine for the same period : selection_040

I think I am going to install a single server to monitor only Telegraf behaviour

Dual-Boot commented 8 years ago

Here is another server sample with the same pattern of CPU spikes : selection_041

Dual-Boot commented 8 years ago

Some news : ProcStat Telegraf - CPU load and CPU Usage for one server (DNS + Puppet Master , only call on demand) selection_042 selection_043 selection_044

Performances are better but I still think that Telegraf agent must be improved. I stay tuned if I can help to investigate further more.

Regards,

Eulerizeit commented 8 years ago

I was running telegraf on a 5 node cluster with greenplum database running a 100 GB tpc-ds workload in the background. I was seeing the high CPU utilization across nodes but was only saving the output on one of the nodes.

Next steps are to hook up the different collectors to Kafka. I can certainly include Telegraf in that testing and I'll be able to forward whatever metrics you're interested in.

Dual-Boot commented 8 years ago

Hello,

I have just update yesterday and since it's worse. The version is 0.12.0-1 And here the graph load + procstat telegraf : selection_045 And below 3 days recording CPU load with upgrade of telegraf package : selection_046

Regards,

sparrc commented 8 years ago

@Dual-Boot before when your CPU load reduced there were no changes that would have reduced cpu load, and now likewise there are no changes that should have increased cpu load. I can't help but feeling this data is a bit random....are the same results seen consistently across multiple servers? have you tried service restarts and seeing what the effect is?

Dual-Boot commented 8 years ago

look here, I could intercept the load with HTOP : selection_047

Dual-Boot commented 8 years ago

@sparrc yes I have the same result on every server running telegraf agent. My Shinken notifications are in stress ;-) Restart service or reboot the server does not change anything.

Dual-Boot commented 8 years ago

Hi,

I tried several test and after a while I point something out : the CPU spikes appear just after this kind of line in the log 2016/04/20 00:21:20 Wrote 19 metrics to output influxdb in 119.515572ms 2016/04/20 00:21:20 Gathered metrics, (10s interval), from 14 inputs in 761.306633ms

In addition, I tried to enable debug mode too but with no success.

  ## Run telegraf in debug mode
  debug = true

Regards,

Dual-Boot commented 8 years ago

to be more precise : this kind of line is followed by CPU spike 2016/04/20 00:31:00 Gathered metrics, (10s interval), from 14 inputs in 491.592139ms

sparrc commented 8 years ago

@Dual-Boot a couple things to try:

  1. upgrade to version 0.12.1, which no longer uses lsof in the netstat plugin
  2. add a "collection jitter" to your [agent] configuration. You can see what this looks like in the latest config file here: https://github.com/influxdata/telegraf/blob/master/etc/telegraf.conf
Dual-Boot commented 8 years ago

Well, well, I think I have found the problem. Setting round_interval to false seems to fix it. I have made the change on server and no more CPU spike on it. I pushed this setting on every server and I will report some news tomorrow.

Good night.

EDIT : I did not see your message ;-)

sparrc commented 8 years ago

hmm, that's surprising to me, are you running multiple telegraf instances on the same host?

FYI I think this might be another coincidence. Setting round_interval=true, literally all that does is tell the telegraf agent to sleep until the next 10s interval when it's starting up.

Dual-Boot commented 8 years ago

Hello,

Since yesterday it's now OK.

are you running multiple telegraf instances on the same host?

=>

ps aux | grep telegraf
telegraf 16508  0.3  1.9 203472 30144 ?        Sl   00:37   2:28 /usr/bin/telegraf -pidfile /var/run/telegraf/telegraf.pid -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d

but : lsof | grep "/bin/telegraf" shows

telegraf  16508         telegraf  txt       REG              253,3  32639042     523410 /usr/bin/telegraf
telegraf  16508 16509   telegraf  txt       REG              253,3  32639042     523410 /usr/bin/telegraf
telegraf  16508 16510   telegraf  txt       REG              253,3  32639042     523410 /usr/bin/telegraf
telegraf  16508 16511   telegraf  txt       REG              253,3  32639042     523410 /usr/bin/telegraf
telegraf  16508 16512   telegraf  txt       REG              253,3  32639042     523410 /usr/bin/telegraf
telegraf  16508 16513   telegraf  txt       REG              253,3  32639042     523410 /usr/bin/telegraf
telegraf  16508 16514   telegraf  txt       REG              253,3  32639042     523410 /usr/bin/telegraf
telegraf  16508 16537   telegraf  txt       REG              253,3  32639042     523410 /usr/bin/telegraf
telegraf  16508 16538   telegraf  txt       REG              253,3  32639042     523410 /usr/bin/telegraf
telegraf  16508 16543   telegraf  txt       REG              253,3  32639042     523410 /usr/bin/telegraf
telegraf  16508 16707   telegraf  txt       REG              253,3  32639042     523410 /usr/bin/telegraf
telegraf  16508 24345   telegraf  txt       REG              253,3  32639042     523410 /usr/bin/telegraf
telegraf  16508 31835   telegraf  txt       REG              253,3  32639042     523410 /usr/bin/telegraf

So 1 processus and 12 threads.

What do you think of this ?

OrangeTux commented 8 years ago

I'm having the same issue on multiple systems. It seems that that amplitude of the load spikes is related to number of input plugins. Below you can see load on a system. At 21:55 I restarted Telegraf with less input plugins configured. Below you can find both configurations. load

$ uname -a
Linux buildroot 4.0.8 #6 Tue Mar 8 16:25:31 UTC 2016 armv5tejl GNU/Linux

I build Telegraf at this commit 36b9e2e077fe45.

# Telegraf Configuration
#
# Telegraf is entirely plugin driven. All metrics are gathered from the
# declared inputs, and sent to the declared outputs.
#
# Plugins must be declared in here to be active.
# To deactivate a plugin, comment out the name and any variables.
#
# Use 'telegraf -config telegraf.conf -test' to see what metrics a config
# file would generate.
#
# Environment variables can be used anywhere in this config file, simply prepend
# them with $. For strings the variable must be within quotes (ie, "$STR_VAR"),
# for numbers and booleans they should be plain (ie, $INT_VAR, $BOOL_VAR)

# Global tags can be specified here in key="value" format.
[global_tags]
  # dc = "us-east-1" # will tag all metrics with dc=us-east-1
  # rack = "1a"
  ## Environment variables can be used as tags, and throughout the config file
  # user = "$USER"
  project = "xxxxx"

# Configuration for telegraf agent
[agent]
  ## Default data collection interval for all inputs
  interval = "10s"
  ## Rounds collection interval to 'interval'
  ## ie, if interval="10s" then always collect on :00, :10, :20, etc.
  round_interval = true

  ## Telegraf will send metrics to outputs in batches of at
  ## most metric_batch_size metrics.
  metric_batch_size = 1000
  ## For failed writes, telegraf will cache metric_buffer_limit metrics for each
  ## output, and will flush this buffer on a successful write. Oldest metrics
  ## are dropped first when this buffer fills.
  metric_buffer_limit = 10000

  ## Collection jitter is used to jitter the collection by a random amount.
  ## Each plugin will sleep for a random time within jitter before collecting.
  ## This can be used to avoid many plugins querying things like sysfs at the
  ## same time, which can have a measurable effect on the system.
  collection_jitter = "0s"

  ## Default flushing interval for all outputs. You shouldn't set this below
  ## interval. Maximum flush_interval will be flush_interval + flush_jitter
  flush_interval = "10s"
  ## Jitter the flush interval by a random amount. This is primarily to avoid
  ## large write spikes for users running a large number of telegraf instances.
  ## ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s
  flush_jitter = "0s"

  ## Run telegraf in debug mode
  debug = false
  ## Run telegraf in quiet mode
  quiet = false
  ## Override default hostname, if empty use os.Hostname()
  hostname = "kas"
  ## If set to true, do no set the "host" tag in the telegraf agent.
  omit_hostname = false

###############################################################################
#                            OUTPUT PLUGINS                                   #
###############################################################################

# Configuration for influxdb server to send metrics to
[[outputs.influxdb]]
  ## The full HTTP or UDP endpoint URL for your InfluxDB instance.
  ## Multiple urls can be specified as part of the same cluster,
  ## this means that only ONE of the urls will be written to each interval.
  # urls = ["udp://localhost:8089"] # UDP endpoint example
  urls = ["http://x.x.x.x:8086"]
  ## The target database for metrics (telegraf will create it if not exists).
  database = "rio" # required
  ## Retention policy to write to.
  retention_policy = "default"
  ## Precision of writes, valid values are "ns", "us" (or "µs"), "ms", "s", "m", "h".
  ## note: using "s" precision greatly improves InfluxDB compression.
  precision = "s"

  ## Write timeout (for the InfluxDB client), formatted as a string.
  ## If not provided, will default to 5s. 0s means no timeout (not recommended).
  timeout = "5s"
  username = "rio"
  password = "**********"
  ## Set the user agent for HTTP POSTs (can be useful for log differentiation)
  # user_agent = "telegraf"
  ## Set UDP payload size, defaults to InfluxDB UDP Client default (512 bytes)
  # udp_payload = 512

  ## Optional SSL Config
  # ssl_ca = "/etc/telegraf/ca.pem"
  # ssl_cert = "/etc/telegraf/cert.pem"
  # ssl_key = "/etc/telegraf/key.pem"
  ## Use SSL but skip chain & host verification
  # insecure_skip_verify = false

###############################################################################
#                            INPUT PLUGINS                                    #
###############################################################################

# Read metrics about cpu usage
[[inputs.cpu]]
  ## Whether to report per-cpu stats or not
  percpu = true
  ## Whether to report total system cpu stats or not
  totalcpu = true
  ## Comment this line if you want the raw CPU time metrics
  fielddrop = ["time_*"]

# Read metrics about disk usage by mount point
[[inputs.disk]]
  ## By default, telegraf gather stats for all mountpoints.
  ## Setting mountpoints will restrict the stats to the specified mountpoints.
  mount_points = ["/", "/tmp"]

  ## Ignore some mountpoints by filesystem type. For example (dev)tmpfs (usually
  ## present on /run, /var/run, /dev/shm or /dev).
  #ignore_fs = ["tmpfs", "devtmpfs"]

# Read metrics about disk IO by device
[[inputs.diskio]]
  ## By default, telegraf will gather stats for all devices including
  ## disk partitions.
  ## Setting devices will restrict the stats to the specified devices.
  # devices = ["sda", "sdb"]
  ## Uncomment the following line if you do not need disk serial numbers.
  # skip_serial_number = true

# Get kernel statistics from /proc/stat
[[inputs.kernel]]
  # no configuration

# Read metrics about memory usage
[[inputs.mem]]
  # no configuration

# Get the number of processes and group them by status
[[inputs.processes]]
  # no configuration

# Read metrics about swap memory usage
[[inputs.swap]]
  # no configuration

# Read metrics about system load & uptime
[[inputs.system]]
  # no configuration

# Read stats about given file(s)
[[inputs.filestat]]
  ## Files to gather stats about.
  ## These accept standard unix glob matching rules, but with the addition of
  ## ** as a "super asterisk". ie:
  ##   "/var/log/**.log"  -> recursively find all .log files in /var/log
  ##   "/var/log/*/*.log" -> find all .log files with a parent dir in /var/log
  ##   "/var/log/apache.log" -> just tail the apache log file
  ##
  ## See https://github.com/gobwas/glob for more examples
  ## 
  files = ["/tmp/supervisor/**.log"]
  ## If true, read the entire file and calculate an md5 checksum.
  md5 = false

# Read metrics about network interface usage
[[inputs.net]]
  ## By default, telegraf gathers stats from any up interface (excluding loopback)
  ## Setting interfaces will tell it to gather these explicit interfaces,
  ## regardless of status.
  ##
  # interfaces = ["eth0"]

# Read TCP metrics such as established, time wait and sockets counts.
[[inputs.netstat]]
  # no configuration

# # Read Nginx's basic status information (ngx_http_stub_status_module)
# [[inputs.nginx]]
#   ## An array of Nginx stub_status URI to gather stats.
#   urls = ["http://localhost/status"]

# Monitor process cpu and memory usage
[[inputs.procstat]]
   exe = "dynamo"

  ## Field name prefix
  prefix = ""
  ## comment this out if you want raw cpu_time stats
  fielddrop = ["cpu_time_*"]

# Monitor process cpu and memory usage
[[inputs.procstat]]
   exe = "telegraf"

  ## Field name prefix
  prefix = ""
  ## comment this out if you want raw cpu_time stats
  fielddrop = ["cpu_time_*"]

# Monitor process cpu and memory usage
[[inputs.procstat]]
   exe = "turbine"

  ## Field name prefix
  prefix = ""
  ## comment this out if you want raw cpu_time stats
  fielddrop = ["cpu_time_*"]

# Monitor process cpu and memory usage
[[inputs.procstat]]
   exe = "json_rpc_server"

  ## Field name prefix
  prefix = ""
  ## comment this out if you want raw cpu_time stats
  # fielddrop = ["cpu_time_*"]

# Read metrics from one or many redis servers
[[inputs.redis]]
  ## specify servers via a url matching:
  ##  [protocol://][:password]@address[:port]
  ##  e.g.
  ##    tcp://localhost:6379
  ##    tcp://:password@192.168.99.100
  ##
  ## If no servers are specified, then localhost is used as the host.
  ## If no port is specified, 6379 is used
  servers = ["tcp://localhost:6379"]

And after 21:55.

# Telegraf Configuration
#
# Telegraf is entirely plugin driven. All metrics are gathered from the
# declared inputs, and sent to the declared outputs.
#
# Plugins must be declared in here to be active.
# To deactivate a plugin, comment out the name and any variables.
#
# Use 'telegraf -config telegraf.conf -test' to see what metrics a config
# file would generate.
#
# Environment variables can be used anywhere in this config file, simply prepend
# them with $. For strings the variable must be within quotes (ie, "$STR_VAR"),
# for numbers and booleans they should be plain (ie, $INT_VAR, $BOOL_VAR)

# Global tags can be specified here in key="value" format.
[global_tags]
  # dc = "us-east-1" # will tag all metrics with dc=us-east-1
  # rack = "1a"
  ## Environment variables can be used as tags, and throughout the config file
  # user = "$USER"
  project = "xxxx"

# Configuration for telegraf agent
[agent]
  ## Default data collection interval for all inputs
  interval = "10s"
  ## Rounds collection interval to 'interval'
  ## ie, if interval="10s" then always collect on :00, :10, :20, etc.
  round_interval = true

  ## Telegraf will send metrics to outputs in batches of at
  ## most metric_batch_size metrics.
  metric_batch_size = 1000
  ## For failed writes, telegraf will cache metric_buffer_limit metrics for each
  ## output, and will flush this buffer on a successful write. Oldest metrics
  ## are dropped first when this buffer fills.
  metric_buffer_limit = 10000

  ## Collection jitter is used to jitter the collection by a random amount.
  ## Each plugin will sleep for a random time within jitter before collecting.
  ## This can be used to avoid many plugins querying things like sysfs at the
  ## same time, which can have a measurable effect on the system.
  collection_jitter = "0s"

  ## Default flushing interval for all outputs. You shouldn't set this below
  ## interval. Maximum flush_interval will be flush_interval + flush_jitter
  flush_interval = "10s"
  ## Jitter the flush interval by a random amount. This is primarily to avoid
  ## large write spikes for users running a large number of telegraf instances.
  ## ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s
  flush_jitter = "0s"

  ## Run telegraf in debug mode
  debug = false
  ## Run telegraf in quiet mode
  quiet = true
  ## Override default hostname, if empty use os.Hostname()
  hostname = "kas"
  ## If set to true, do no set the "host" tag in the telegraf agent.
  omit_hostname = false

###############################################################################
#                            OUTPUT PLUGINS                                   #
###############################################################################

# Configuration for influxdb server to send metrics to
[[outputs.influxdb]]
  ## The full HTTP or UDP endpoint URL for your InfluxDB instance.
  ## Multiple urls can be specified as part of the same cluster,
  ## this means that only ONE of the urls will be written to each interval.
  # urls = ["udp://localhost:8089"] # UDP endpoint example
  urls = ["http://xxxx.nl:8086"]
  ## The target database for metrics (telegraf will create it if not exists).
  database = "rio" # required
  ## Retention policy to write to.
  retention_policy = "default"
  ## Precision of writes, valid values are "ns", "us" (or "µs"), "ms", "s", "m", "h".
  ## note: using "s" precision greatly improves InfluxDB compression.
  precision = "s"

  ## Write timeout (for the InfluxDB client), formatted as a string.
  ## If not provided, will default to 5s. 0s means no timeout (not recommended).
  timeout = "5s"
  username = "rio"
  password = "xxxx"
  ## Set the user agent for HTTP POSTs (can be useful for log differentiation)
  # user_agent = "telegraf"
  ## Set UDP payload size, defaults to InfluxDB UDP Client default (512 bytes)
  # udp_payload = 512

  ## Optional SSL Config
  # ssl_ca = "/etc/telegraf/ca.pem"
  # ssl_cert = "/etc/telegraf/cert.pem"
  # ssl_key = "/etc/telegraf/key.pem"
  ## Use SSL but skip chain & host verification
  # insecure_skip_verify = false

###############################################################################
#                            INPUT PLUGINS                                    #
###############################################################################

[[inputs.system]]

# Read metrics about network interface usage
[[inputs.net]]
  ## By default, telegraf gathers stats from any up interface (excluding loopback)
  ## Setting interfaces will tell it to gather these explicit interfaces,
  ## regardless of status.
  ##
  # interfaces = ["eth0"]
OrangeTux commented 8 years ago

Both running Telegraf with GOGC=off and round_interval=false don't solve the problem. I also ran Telegraf with GODEBUG=gctrace=1 yesterday and this didn't reveal weird behaviour.

sparrc commented 8 years ago

@Orangetux could you try running with collection_jitter?

OrangeTux commented 8 years ago

@sparrc This seemed to solve the spikes. Thank you

Dual-Boot commented 8 years ago

Hi,

How to to run with collection_jitter ? here is my conf agent :

[agent]
  ## Default data collection interval for all inputs
  interval = "10s"
  ## Rounds collection interval to 'interval'
  ## ie, if interval="10s" then always collect on :00, :10, :20, etc.
  # round_interval = true
  round_interval = false

  ## Telegraf will send metrics to outputs in batches of at
  ## most metric_batch_size metrics.
  metric_batch_size = 1000
  ## For failed writes, telegraf will cache metric_buffer_limit metrics for each
  ## output, and will flush this buffer on a successful write. Oldest metrics
  ## are dropped first when this buffer fills.
  metric_buffer_limit = 10000

  ## Collection jitter is used to jitter the collection by a random amount.
  ## Each plugin will sleep for a random time within jitter before collecting.
  ## This can be used to avoid many plugins querying things like sysfs at the
  ## same time, which can have a measurable effect on the system.
  collection_jitter = "0s"

  ## Default flushing interval for all outputs. You shouldn't set this below
  ## interval. Maximum flush_interval will be flush_interval + flush_jitter
  flush_interval = "10s"
  ## Jitter the flush interval by a random amount. This is primarily to avoid
  ## large write spikes for users running a large number of telegraf instances.
  ## ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s
  flush_jitter = "0s"

  ## Run telegraf in debug mode
  debug = false
  ## Run telegraf in quiet mode
  quiet = false
  ## Override default hostname, if empty use os.Hostname()
  hostname = ""
  ## If set to true, do no set the "host" tag in the telegraf agent.
  omit_hostname = false

And for information turning round_interval to false solved my problem.

regards,

sparrc commented 8 years ago

Try setting round_interval=true and set jitter to something like collection_jitter="3s"

sparrc commented 8 years ago

any updates on this @Dual-Boot? It sounds like setting collection_jitter might help your issue. It could be that querying procfs at the same time from all plugins is stressing your system.

Dual-Boot commented 8 years ago

Sorry to be late. I have just done what you adviced.

I 'll be back ;-)

Dual-Boot commented 8 years ago

I have pushed the modification on each server since half an hour and it works.

sparrc commented 8 years ago

OK, closing as resolved by using the collection_jitter config option

nhannguyen commented 7 years ago

Hi, I have the same issue on almost all my servers. I've already try to set collection_jitter = "5s" and round_interval = true. The spikes still appear, but with lower peaks. On the server with highest load, the spikes don't change at all.

Here's my config

[agent]
collection_jitter = "5s"
flush_interval = "10s"
flush_jitter = "5s"
interval = "10s"
round_interval = true
[tags]

And the screenshots

screenshot 2016-12-27 09 45 04 screenshot 2016-12-27 09 39 40

I installed latest telegraf from influxdata repo.

sparrc commented 7 years ago

@nhannguyen 0.1-0.2 is a very small load average...what level are you expecting?

can you provide your full configuration? it would help to know which plugins you are using.

akaDJon commented 6 years ago

Is Problem solved?

melroy89 commented 6 years ago

@akaDJon I guess not. I also see a lot of CPU and memory usage (relatively). Maybe I'm polling too often and/or have too much telegraf inputs active.