influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.57k stars 5.56k forks source link

Invalid measurements found on Chronograf hours after setting up #3157

Closed sebastianarena closed 7 years ago

sebastianarena commented 7 years ago

Hello team.

I've setup a TICK stack over a week ago, putting only one server to track everything, and play with the platform. Loved it, and decided to install a telegraf server on each of the multiple servers I manage. Each server was configured to track the basic Telegraf information, and mongo + redis inputs.

Leaving that to work over the weekend, I came to see a lot of invalid measurements on Chronograf.

screen shot 2017-08-22 at 23 46 08 screen shot 2017-08-22 at 21 04 39 screen shot 2017-08-22 at 23 51 50

This is clearly 127.0.0.1:27017 reporting from mongo, all cut up into pieces for some reason

I wiped the telegraf database, and started from scratch. After several hours, the problem happens again.

To my recollection, it seems that measurements are being "cut off" at some point, and from then on it generates a lot of invalid data, that is not well parsed by InfluxDB, but somehow stored anyways.

I found some logs that state that the data was unable to be parsed correctly, where it shows cut off data of some sort.

Is there anything in the configuration I should do to prevent this from happening? Is this a bug?

Thanks!

2017-08-23T02:01:40Z E! InfluxDB Output Error: Response Error: Status Code [400], expected [204], [partial write: unable to parse 'objects=1185290i,indexes=51i,type="db_stat",data_size=616619670i,index_size=87994368i,collections=14i,avg_obj_size=520.2268390014258,ok=1i,storage_size=207269888i,num_extents=0i 1503453700000000000': invalid field format dropped=0]
2017-08-23T02:03:20Z E! InfluxDB Output Error: Response Error: Status Code [400], expected [204], [partial write: unable to parse '29224960i,indexes=37i,ok=1i,index_size=24461312i 1503453800000000000': invalid field format dropped=0]
2017-08-23T02:03:50Z E! InfluxDB Output Error: Response Error: Status Code [400], expected [204], [partial write: unable to parse 'e_size=29224960i,num_extents=0i,index_size=24461312i,collections=9i,indexes=37i,type="db_stat" 1503453830000000000': invalid field format dropped=0]
2017-08-23T02:04:30Z E! InfluxDB Output Error: Response Error: Status Code [400], expected [204], [partial write: unable to parse '=520.2268702174151,storage_size=207269888i,num_extents=0i,type="db_stat",data_size=616619707i,indexes=51i,ok=1i 1503453870000000000': invalid field format dropped=0]
2017-08-23T02:09:50Z E! InfluxDB Output Error: Response Error: Status Code [400], expected [204], [partial write: unable to parse 'orage_size=29224960i,num_extents=0i,index_size=24461312i,collections=9i,ok=1i 1503454190000000000': invalid field format dropped=0]
2017-08-23T02:10:20Z E! InfluxDB Output Error: Response Error: Status Code [400], expected [204], [partial write: unable to parse 'tat" 1503454220000000000': invalid field format dropped=0]
2017-08-23T02:13:10Z E! InfluxDB Output Error: Response Error: Status Code [400], expected [204], [partial write: unable to parse 'ize=36864i,index_size=229376i,objects=2i,ok=1i 1503454390000000000': invalid field format dropped=0]
2017-08-23T02:16:10Z E! InfluxDB Output Error: Response Error: Status Code [400], expected [204], [partial write: unable to parse '87733908,num_extents=0i,data_size=54886210i,storage_size=29224960i,indexes=37i,ok=1i,index_size=24461312i,collections=9i,objects=201254i 1503454570000000000': invalid field format dropped=0]
goller commented 7 years ago

Hi @sebastianarena , thanks for writing in !

Yikes, there must be some sort of line protocol issue when telegraf sends data to influx.
Do you mind sending the telegraf.conf?

Also, I'd like to see a bit of the output from the telegraf plugins. To get some data out would your run this two or three times and send the information:

telegraf --config telegraf.conf -test
danielnelson commented 7 years ago

There is a bug that looks similar to this which has not been isolated. It seems to occur when using HTTP2 and nginx. https://github.com/influxdata/telegraf/issues/2854

Any chance you are also using HTTP2 and nginx?

sebastianarena commented 7 years ago

Hi! YES! I'm using NGINX and HTTP2 as default! Let me try removing HTTP2 and checking if it solves the problem.

sebastianarena commented 7 years ago

Just did a couple of tests. It seems to have worked, but not all the way. I have one server telling me at least once a minute that there's an HTTP error 400, and automatically there's an invalid measurement created. For the record:

Have a look:

2017-08-23T21:03:30Z E! InfluxDB Output Error: Response Error: Status Code [400], expected [204], [partial write: unable to parse '39287i,interrupts=341323616i,context_switches=560217853i,boot_time=1502293697i 1503522210000000000': invalid field format dropped=0]

screen shot 2017-08-23 at 18 05 32

Here's my telegraf.conf:

[global_tags]
[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = ""
  debug = false
  quiet = false
  logfile = ""
  hostname = ""
  omit_hostname = false
[[outputs.influxdb]]
  urls = ["<redacted>"] # required
  database = "telegraf" # required
  retention_policy = ""
  write_consistency = "any"
  timeout = "5s"
  username = "<redacted>"
  password = "<redacted>"
[[inputs.cpu]]
  percpu = true
  totalcpu = true
  collect_cpu_time = false
[[inputs.disk]]
  ignore_fs = ["tmpfs", "devtmpfs", "devfs"]
[[inputs.diskio]]
[[inputs.kernel]]
[[inputs.mem]]
[[inputs.processes]]
[[inputs.swap]]
[[inputs.system]]
[[inputs.internal]]
   collect_memstats = true
[[inputs.mongodb]]
   servers = ["127.0.0.1:27017"]
   gather_perdb_stats = true
[[inputs.redis]]
   servers = ["tcp://localhost:6379"]

Here's a sample output like @goller wanted, specifically from the server that seems to be showing more problems in the logs:

2017/08/23 21:08:42 I! Using config file: /etc/telegraf/telegraf.conf
* Plugin: inputs.mongodb, Collection 1
2017-08-23T21:08:42Z W! [inputs.mongodb] Using "mongodb://127.0.0.1:27017" as connection URL; please update your configuration to use an URL
* Plugin: inputs.mongodb, Collection 2
> mongodb,hostname=127.0.0.1:27017,host=redacted net_out_bytes=24025i,deletes_per_sec=0i,vsize_megabytes=1119i,percent_cache_used=5.3,getmores_per_sec=0i,active_writes=0i,queued_reads=0i,ttl_deletes_per_sec=0i,active_reads=0i,commands_per_sec=7i,updates_per_sec=0i,ttl_passes_per_sec=0i,resident_megabytes=193i,flushes_per_sec=0i,queries_per_sec=0i,queued_writes=0i,open_connections=8i,inserts_per_sec=0i,jumbo_chunks=0i,percent_cache_dirty=0,net_in_bytes=472i 1503522523000000000
> mongodb_db_stats,host=redacted,db_name=admin,hostname=127.0.0.1:27017 ok=1i,objects=1i,type="db_stat",num_extents=0i,avg_obj_size=59,data_size=59i,indexes=2i,index_size=32768i,storage_size=16384i,collections=1i 1503522523000000000
> mongodb_db_stats,hostname=127.0.0.1:27017,host=redacted,db_name=redacted collections=7i,objects=3295i,data_size=29828269i,num_extents=0i,ok=1i,index_size=569344i,type="db_stat",avg_obj_size=9052.585432473445,storage_size=20119552i,indexes=28i 1503522523000000000
> mongodb_db_stats,hostname=127.0.0.1:27017,host=redacted,db_name=local type="db_stat",collections=1i,storage_size=16384i,ok=1i,index_size=16384i,objects=1i,avg_obj_size=1657,data_size=1657i,num_extents=0i,indexes=1i 1503522523000000000
* Plugin: inputs.cpu, Collection 1
* Plugin: inputs.cpu, Collection 2
> cpu,host=redacted,cpu=cpu0 usage_softirq=0,usage_user=0,usage_system=0,usage_nice=0,usage_iowait=0,usage_irq=0,usage_idle=100,usage_steal=0,usage_guest=0,usage_guest_nice=0 1503522524000000000
> cpu,cpu=cpu1,host=redacted usage_irq=0,usage_guest_nice=0,usage_nice=0,usage_system=0,usage_idle=100,usage_iowait=0,usage_softirq=0,usage_steal=0,usage_guest=0,usage_user=0 1503522524000000000
> cpu,cpu=cpu-total,host=redacted usage_system=0,usage_idle=100,usage_irq=0,usage_steal=0,usage_guest=0,usage_user=0,usage_nice=0,usage_iowait=0,usage_softirq=0,usage_guest_nice=0 1503522524000000000
* Plugin: inputs.kernel, Collection 1
> kernel,host=redacted interrupts=341422671i,context_switches=560370475i,boot_time=1502293697i,processes_forked=1039626i 1503522524000000000
* Plugin: inputs.mem, Collection 1
> mem,host=redacted buffered=83603456i,inactive=1447612416i,used_percent=31.371947721004585,available=2841600000i,free=704352256i,cached=2127491072i,available_percent=68.62805227899541,total=4140580864i,used=1298980864i,active=1696927744i 1503522524000000000
* Plugin: inputs.processes, Collection 1
> processes,host=redacted sleeping=260i,total=260i,unknown=0i,total_threads=540i,running=0i,zombies=0i,stopped=0i,dead=0i,paging=0i,blocked=0i 1503522524000000000
* Plugin: inputs.swap, Collection 1
> swap,host=redacted used=0i,free=536866816i,used_percent=0,total=536866816i 1503522524000000000
> swap,host=redacted in=0i,out=0i 1503522524000000000
* Plugin: inputs.system, Collection 1
> system,host=redacted load1=0.01,load5=0.04,load15=0.05,n_users=1i,n_cpus=2i 1503522524000000000
> system,host=redacted uptime=1228826i,uptime_format="14 days,  5:20" 1503522524000000000
* Plugin: inputs.internal, Collection 1
> internal_memstats,host=redacted sys_bytes=12232952i,mallocs=57067i,heap_sys_bytes=9011200i,alloc_bytes=3825240i,heap_objects=13631i,num_gc=3i,heap_alloc_bytes=3825240i,heap_idle_bytes=3883008i,heap_in_use_bytes=5128192i,heap_released_bytes=0i,frees=43436i,pointer_lookups=1182i,total_alloc_bytes=9563888i 1503522524000000000
> internal_gather,input=internal,host=redacted metrics_gathered=1i 1503522524000000000
> internal_agent,host=redacted metrics_dropped=0i,metrics_gathered=15i,gather_errors=0i,metrics_written=0i 1503522524000000000
> internal_write,output=influxdb,host=redacted buffer_limit=10000i,write_time_ns=0i,metrics_written=0i,metrics_filtered=0i,buffer_size=0i 1503522524000000000
> internal_gather,input=processes,host=redacted metrics_gathered=1i 1503522524000000000
> internal_gather,host=redacted,input=swap metrics_gathered=2i 1503522524000000000
> internal_gather,input=system,host=redacted metrics_gathered=2i 1503522524000000000
> internal_gather,input=disk,host=redacted metrics_gathered=0i 1503522524000000000
> internal_gather,input=diskio,host=redacted metrics_gathered=0i 1503522524000000000
> internal_gather,host=redacted,input=mongodb metrics_gathered=4i 1503522524000000000
> internal_gather,input=cpu,host=redacted metrics_gathered=3i 1503522524000000000
> internal_gather,input=kernel,host=redacted metrics_gathered=1i 1503522524000000000
> internal_gather,input=mem,host=redacted metrics_gathered=1i 1503522524000000000
> internal_gather,input=redis,host=redacted metrics_gathered=0i 1503522524000000000
* Plugin: inputs.redis, Collection 1
> redis_keyspace,host=redacted,server=localhost,port=6379,replication_role=master,database=db0 keys=91i,expires=88i,avg_ttl=57713995i 1503522524000000000
> redis,server=localhost,port=6379,replication_role=master,host=redacted uptime=1227401i,rdb_current_bgsave_time_sec=-1i,rdb_last_save_time=1502295122i,sync_full=0i,keyspace_misses=2271984i,loading=0i,rdb_changes_since_last_save=4549087i,aof_pending_rewrite=0i,used_cpu_user=722.27,aof_rewrite_buffer_length=0i,aof_pending_bio_fsync=0i,evicted_keys=0i,keyspace_hits=2270404i,master_repl_offset=0i,maxmemory_policy="noeviction",aof_enabled=1i,aof_rewrite_in_progress=0i,sync_partial_ok=0i,keyspace_hitrate=0.49982608266841144,used_memory=827424i,rdb_last_bgsave_time_sec=-1i,aof_current_size=41783033i,expired_keys=3601i,client_biggest_input_buf=0i,mem_fragmentation_ratio=6.35,rejected_connections=0i,repl_backlog_active=0i,used_memory_peak=1650944i,aof_rewrite_scheduled=0i,pubsub_channels=1i,repl_backlog_size=1048576i,aof_current_rewrite_time_sec=-1i,instantaneous_output_kbps=0.03,connected_slaves=0i,repl_backlog_histlen=0i,used_cpu_sys=1093.44,client_longest_output_list=0i,aof_buffer_length=0i,total_commands_processed=11419912i,repl_backlog_first_byte_offset=0i,rdb_bgsave_in_progress=0i,aof_last_write_status="ok",aof_delayed_fsync=0i,total_net_output_bytes=156369890i,sync_partial_err=0i,clients=14i,used_memory_rss=5255168i,used_cpu_user_children=0,total_connections_received=51764i,instantaneous_ops_per_sec=12i,instantaneous_input_kbps=0.83,used_memory_lua=39936i,maxmemory=0i,rdb_last_bgsave_status="ok",pubsub_patterns=0i,latest_fork_usec=204i,blocked_clients=9i,total_system_memory=4140580864i,aof_last_bgrewrite_status="ok",aof_base_size=21516i,migrate_cached_sockets=0i,used_cpu_sys_children=0,lru_clock=10350299i,rdb_last_save_time_elapsed=1227401i,aof_last_rewrite_time_sec=0i,total_net_input_bytes=770295625i,cluster_enabled=0i 1503522524000000000
* Plugin: inputs.disk, Collection 1
> disk,device=root,fstype=ext4,host=redacted,path=/ used=2692468736i,used_percent=5.649412439311064,inodes_total=2954880i,inodes_free=2870082i,inodes_used=84798i,total=50226184192i,free=44966801408i 1503522524000000000
* Plugin: inputs.diskio, Collection 1
> diskio,name=sdb,host=redacted read_bytes=2256896i,write_bytes=0i,read_time=63i,write_time=0i,io_time=50i,writes=0i,iops_in_progress=0i,reads=88i 1503522524000000000
> diskio,name=sda,host=redacted write_time=1548376i,io_time=1486360i,reads=12704i,read_bytes=354658304i,write_bytes=96178057216i,read_time=18120i,writes=4443201i,iops_in_progress=0i 1503522524000000000
danielnelson commented 7 years ago

On the failing host can you try to isolate the error to a single plugin?

danielnelson commented 7 years ago

It is possible that #3142 is related.

danielnelson commented 7 years ago

@sebastianarena #3142 is fixed. I think it is a long shot, but can you test with 1.4.0-rc2 to see if there is any improvement?

sebastianarena commented 7 years ago

Hello all. Here's what I'm trying to do at the moment.

Regarding the host that causes trouble, it was all of them. I have centralized all our system logs, and I could clearly see that every server got an HTTP 400 from Influx at some moment, which leads to the invalid measurements.

Regarding the exact plugin that is causing the issue, I only have the basic stuff, plus Mongo and Redis. I was suspecting those might be trouble, but decided to try something else instead.

I shut down all Telegraf remote servers, and just left the TICK stack on one server, and one more remotely pushing measurements. So far, 1h after that, everything is looking ok.

Which seems to indicate that is not related to Telegraf, but to Influx per se, that gets overloaded with multiple servers sending their data in at the same time. Which worries me even more.

I'll leave this testing open for the time being, and will add hosts one at a time until it crashes again. Any ideas? Maybe switch to UDP instead of HTTP?

danielnelson commented 7 years ago

UDP should be good for performance but obviously less reliable. In 1.4 we also are adding gzip support which should be an improvement if you are sending over a slow connection.

From what I have read, nginx only does HTTP/2 on the client side while the proxy side is HTTP/1.1, so it seems to me that this must be either a bug in Telegraf or nginx, since I haven't had any reports of this happening in HTTP/1.1 mode.

sebastianarena commented 7 years ago

I don't think nginx is involved.. because with a single host everything works fine. The problem starts happening when I start adding more hosts, and more measurements.

The end of this was that I played the last couple of hours with interval, flush, and the jitters. I slowed down the intake of inputs, and now everything seems to be working fine.

Something is wrong somewhere, and possibly gzip would be an awesome addition to speed things up.

For now, I think I'm ok. THANKS

danielnelson commented 7 years ago

Lets merge this issue with #2854