influxdata / influxdb

Scalable datastore for metrics, events, and real-time analytics
https://influxdata.com
Apache License 2.0
29.04k stars 3.56k forks source link

The telegraf impact of Hard disk perfomance #7343

Closed donghyun-ji closed 8 years ago

donghyun-ji commented 8 years ago

Directions

I am carrying out test how much data into the influxDB per a unit second by using the Telegraph.

Bug report

System Info

1) OS : Ubuntu 14.04 2) InfluxDB : v1.0 3) Telegraf : v1.0 4) Telegraf input : udp_listener 5) UDP application : python 6) Hard Disk Drive(HDD)

The telegraf input was set to udp _listen and It was execute a Python program that udp communications

After, I was counting that inserted to InfluxDB. A result data has been lost.(10,000 of data inserted, but went only 9866) On the other hand not a loss occurs when it is tested in the solid state drive(SSD).

Why does the loss of data? Is performance of a hard disk related to loss of data?

Setting telegraf agent

[agent]
  ## Default data collection interval for all inputs
  interval = "10s"
  ## Rounds collection interval to 'interval'
  ## ie, if interval="10s" then always collect on :00, :10, :20, etc.
  round_interval = true

  ## Telegraf will send metrics to outputs in batches of at
  ## most metric_batch_size metrics.
  metric_batch_size = 1000
  ## For failed writes, telegraf will cache metric_buffer_limit metrics for each
  ## output, and will flush this buffer on a successful write. Oldest metrics
  ## are dropped first when this buffer fills.
  metric_buffer_limit = 10000

  ## Collection jitter is used to jitter the collection by a random amount.
  ## Each plugin will sleep for a random time within jitter before collecting.
  ## This can be used to avoid many plugins querying things like sysfs at the
  ## same time, which can have a measurable effect on the system.
  collection_jitter = "0s"

  ## Default flushing interval for all outputs. You shouldn't set this below
  ## interval. Maximum flush_interval will be flush_interval + flush_jitter
  flush_interval = "10s"
  ## Jitter the flush interval by a random amount. This is primarily to avoid
  ## large write spikes for users running a large number of telegraf instances.
  ## ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s
  flush_jitter = "0s"

  ## By default, precision will be set to the same timestamp order as the
  ## collection interval, with the maximum being 1s.
  ## Precision will NOT be used for service inputs, such as logparser and statsd.
  ## Valid values are "ns", "us" (or "µs"), "ms", "s".
  precision = ""
  ## Run telegraf in debug mode
  debug = false
  ## Run telegraf in quiet mode
  quiet = false
  ## Override default hostname, if empty use os.Hostname()

Setting telegraf input

# # Generic UDP listener
 [[inputs.udp_listener]]
#   ## Address and port to host UDP listener on
   service_address = ":8092"
#
#   ## Number of UDP messages allowed to queue up. Once filled, the
#   ## UDP listener will start dropping packets.
   allowed_pending_messages = 10000
#
#   ## Data format to consume.
#   ## Each data format has it's own unique set of configuration options, read
#   ## more about them here:
#   ## https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_INPUT.md
   data_format = "influx"
phemmer commented 8 years ago

Telegraf has its own issue tracker over at https://github.com/influxdata/telegraf/issues The issue here sounds like socket buffer size. If the UDP buffer fills up before the app drains it, packets will be dropped. The allowed_pending_messages setting controls a different buffer inside the application, one that comes after the OS socket buffer. Telegraf does not have a way of controlling this socket buffer. But if reliability is a concern, you probably want to use a reliable protocol like TCP instead. While changes could be made to telegraf to decrease the liklihood of packets being dropped, it'll never be completely preventable as long as you are still using UDP.

jsternberg commented 8 years ago

I think I agree with @phemmer here. This question may be better suited for the mailing list though. Since you're using UDP, you should also expect no guarantee that all points will reach their destination. If you need reliability, use TCP.