influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.59k stars 5.56k forks source link

Service stop after logrotate #685

Closed wgrcunha closed 8 years ago

wgrcunha commented 8 years ago

Hello,

Sometimes i see the agent running but not sending metrics or ever logging, when this happens the logfile is empty and i believe is caused by log rotate. I added to logrotate:

postrotate
        service telegraf restart > /dev/null

And now, working fine. Anyone with the same problem?

**Using debian package

Thanks!

sparrc commented 8 years ago

certainly sounds like a bug, but I'd like to know if others are seeing it as well

wgrcunha commented 8 years ago

@sparrc Is possible to change to use syslog instead of logging to a file? And we control the rotate, log, etc using syslog programs.

sparrc commented 8 years ago

sure, but I won't have time to work on this for a while, and I think that you could also use systemd which logs to journalctl instead of the file

wgrcunha commented 8 years ago

@sparrc After did more tests even using systemd the agent stop to send metrics:

Sample log, the last wrote from agent is at 15:43:01 and only start reporting again when i restart the daemon

2016/02/19 15:42:21 Wrote 0 metrics to output influxdb in 1.265238955s 2016/02/19 15:42:31 Wrote 0 metrics to output influxdb in 1.214569861s 2016/02/19 15:42:41 Wrote 0 metrics to output influxdb in 1.254252425s 2016/02/19 15:42:51 Wrote 0 metrics to output influxdb in 1.006064584s 2016/02/19 15:43:00 Gathered metrics, (1m0s interval), from 9 inputs in 127.702047ms 2016/02/19 15:43:01 Wrote 64 metrics to output influxdb in 1.119452438s 2016/02/19 15:44:00 Gathered metrics, (1m0s interval), from 9 inputs in 123.996917ms 2016/02/19 15:45:00 Gathered metrics, (1m0s interval), from 9 inputs in 144.462953ms 2016/02/19 15:46:00 Gathered metrics, (1m0s interval), from 9 inputs in 131.375083ms 2016/02/19 15:47:00 Gathered metrics, (1m0s interval), from 9 inputs in 122.787756ms 2016/02/19 15:48:00 Gathered metrics, (1m0s interval), from 9 inputs in 122.867636ms 2016/02/19 15:49:00 Gathered metrics, (1m0s interval), from 9 inputs in 123.180944ms 2016/02/19 15:50:00 Gathered metrics, (1m0s interval), from 9 inputs in 123.750488ms 2016/02/19 15:51:00 Gathered metrics, (1m0s interval), from 9 inputs in 136.962833ms 2016/02/19 15:52:00 Gathered metrics, (1m0s interval), from 9 inputs in 123.83557ms 2016/02/19 15:53:00 Gathered metrics, (1m0s interval), from 9 inputs in 152.097537ms 2016/02/19 15:54:00 Gathered metrics, (1m0s interval), from 9 inputs in 150.002685ms 2016/02/19 15:55:00 Gathered metrics, (1m0s interval), from 9 inputs in 125.264287ms 2016/02/19 15:56:00 Gathered metrics, (1m0s interval), from 9 inputs in 133.351036ms 2016/02/19 15:57:00 Gathered metrics, (1m0s interval), from 9 inputs in 120.823581ms 2016/02/19 15:58:00 Gathered metrics, (1m0s interval), from 9 inputs in 144.377225ms 2016/02/19 16:45:15 Starting Telegraf (version 0.10.2) 2016/02/19 16:45:15 Loaded outputs: influxdb 2016/02/19 16:45:15 Loaded inputs: swap system net netstat cpu disk diskio mem exec 2016/02/19 16:45:15 Tags enabled: cfdescriptions=* host=* team=* tool=telegraf 2016/02/19 16:45:15 Agent Config: Interval:1m0s, Debug:false, Quiet:false, Hostname:"*", Flush Interval:10s 2016/02/19 16:46:00 Gathered metrics, (1m0s interval), from 9 inputs in 175.587834ms 2016/02/19 16:46:10 Wrote 64 metrics to output influxdb in 66.437421ms 2016/02/19 16:46:20 Wrote 0 metrics to output influxdb in 20.614915ms 2016/02/19 16:46:30 Wrote 0 metrics to output influxdb in 22.260462ms 2016/02/19 16:46:40 Wrote 0 metrics to output influxdb in 18.838846ms 2016/02/19 16:46:50 Wrote 0 metrics to output influxdb in 20.438875ms 2016/02/19 16:47:00 Gathered metrics, (1m0s interval), from 9 inputs in 140.146721ms

Is the 0.10.2 version, i will try to update to 0.10.3 today to see. I have about 1000 hosts reporting and this happens on random servers every day.

sparrc commented 8 years ago

@wgrcunha So you are seeing this issue even after adding the logrotate change? It looks like your instance is still logging but something is hung in the write path?

wgrcunha commented 8 years ago

@sparrc Yes. And using systemd too

sparrc commented 8 years ago

I'll try to reproduce, can you give your OS details?

If you are able to ctrl-\ the process when this is happening that would help a lot

wgrcunha commented 8 years ago

/etc/debian_version -> 7.8 uname -a -> Linux fring0013 3.2.0-4-amd64 #1 SMP Debian 3.2.68-1+deb7u1 x86_64 GNU/Linux

Tried to run as root, but got the same problem

root     21881  0.3  0.0 199708 12984 ?        Sl   06:25   2:17 /usr/bin/telegraf -pidfile /var/run/telegraf/telegraf.pid -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d

Log

2016/02/19 15:42:01 Wrote 60 metrics to output influxdb in 1.333597222s
2016/02/19 15:42:03 Gathered metrics, (1m0s interval), from 8 inputs in 3.921914358s
2016/02/19 15:42:10 Wrote 2 metrics to output influxdb in 116.105263ms
2016/02/19 15:42:20 Wrote 0 metrics to output influxdb in 41.026027ms
2016/02/19 15:42:30 Wrote 0 metrics to output influxdb in 58.730717ms
2016/02/19 15:42:40 Wrote 0 metrics to output influxdb in 76.754311ms
2016/02/19 15:42:50 Wrote 0 metrics to output influxdb in 58.222168ms
2016/02/19 15:43:01 Wrote 60 metrics to output influxdb in 1.101402649s
2016/02/19 15:43:02 Gathered metrics, (1m0s interval), from 8 inputs in 2.556971171s
2016/02/19 15:44:03 Gathered metrics, (1m0s interval), from 8 inputs in 3.829782121s
2016/02/19 15:45:04 Gathered metrics, (1m0s interval), from 8 inputs in 4.069923161s
2016/02/19 15:46:02 Gathered metrics, (1m0s interval), from 8 inputs in 2.657605177s
2016/02/19 15:47:03 Gathered metrics, (1m0s interval), from 8 inputs in 3.876193623s
2016/02/19 15:48:02 Gathered metrics, (1m0s interval), from 8 inputs in 2.772723071s
2016/02/19 15:49:04 Gathered metrics, (1m0s interval), from 8 inputs in 4.275398812s
2016/02/19 15:50:02 Gathered metrics, (1m0s interval), from 8 inputs in 2.990015137s
2016/02/19 15:51:04 Gathered metrics, (1m0s interval), from 8 inputs in 4.622869253s
2016/02/19 15:52:03 Gathered metrics, (1m0s interval), from 8 inputs in 3.633749382s
2016/02/19 15:53:04 Gathered metrics, (1m0s interval), from 8 inputs in 4.105144697s
2016/02/19 15:54:03 Gathered metrics, (1m0s interval), from 8 inputs in 3.150430417s
2016/02/19 15:55:03 Gathered metrics, (1m0s interval), from 8 inputs in 3.628746145s
2016/02/19 15:56:03 Gathered metrics, (1m0s interval), from 8 inputs in 3.755732228s
2016/02/19 15:57:04 Gathered metrics, (1m0s interval), from 8 inputs in 4.275485434s
2016/02/19 15:58:03 Gathered metrics, (1m0s interval), from 8 inputs in 3.352019822s
2016/02/19 15:59:03 Gathered metrics, (1m0s interval), from 8 inputs in 3.810145085s

lsof

COMMAND    PID USER   FD   TYPE     DEVICE SIZE/OFF    NODE NAME
telegraf 21881 root  cwd    DIR      202,1     4096       2 /
telegraf 21881 root  rtd    DIR      202,1     4096       2 /
telegraf 21881 root  txt    REG      202,1 30978672 1712258 /usr/bin/telegraf
telegraf 21881 root  mem    REG      202,1  1599504  137234 /lib/x86_64-linux-gnu/libc-2.13.so
telegraf 21881 root  mem    REG      202,1   131107  137249 /lib/x86_64-linux-gnu/libpthread-2.13.so
telegraf 21881 root  mem    REG      202,1   136936  137246 /lib/x86_64-linux-gnu/ld-2.13.so
telegraf 21881 root    0r   CHR        1,3      0t0    1028 /dev/null
telegraf 21881 root    1w   CHR        1,3      0t0    1028 /dev/null
telegraf 21881 root    2w   REG      202,1   292540 1179684 /var/log/telegraf/telegraf.log
telegraf 21881 root    3r   CHR        1,9      0t0    1033 /dev/urandom
telegraf 21881 root    4u  IPv4 1724666747      0t0     TCP ********:60653->**********:https (ESTABLISHED)
telegraf 21881 root    6u  0000        0,9        0    3270 anon_inode

Reptyr

[-] Timed out waiting for child stop.
[+] Allocated scratch page: 7fee0ed9d000
[+] Opened the new tty in the child: 5
[+] Target is not a session leader, attempting to setsid.
[+] Forked a child: 23568
[+] Did setsid()
[+] Set the controlling tty
^\SIGQUIT: quit
PC=0x460cd1 m=0

goroutine 0 [idle]:
runtime.futex(0x18a20b0, 0x0, 0x0, 0x0, 0x0, 0x18a0fc0, 0x0, 0x0, 0x410b84, 0x18a20b0, ...)
    /usr/local/go/src/runtime/sys_linux_amd64.s:288 +0x21
runtime.futexsleep(0x18a20b0, 0x0, 0xffffffffffffffff)
    /usr/local/go/src/runtime/os1_linux.go:39 +0x53
runtime.notesleep(0x18a20b0)
    /usr/local/go/src/runtime/lock_futex.go:142 +0xa4
runtime.stopm()
    /usr/local/go/src/runtime/proc1.go:1136 +0x112
runtime.exitsyscall0(0xc820969680)
    /usr/local/go/src/runtime/proc1.go:2146 +0x16c
runtime.mcall(0x7fff56bb0f70)
    /usr/local/go/src/runtime/asm_amd64.s:204 +0x5b

goroutine 1 [semacquire, 94 minutes]:
sync.runtime_Semacquire(0xc8205c0afc)
    /usr/local/go/src/runtime/sema.go:43 +0x26
sync.(*WaitGroup).Wait(0xc8205c0af0)
    /usr/local/go/src/sync/waitgroup.go:126 +0xb4
github.com/influxdata/telegraf.(*Agent).gatherParallel(0xc82002a108, 0xc8205b2420, 0x0, 0x0)
    /home/telebuild/go/src/github.com/influxdata/telegraf/agent.go:148 +0x145
github.com/influxdata/telegraf.(*Agent).Run(0xc82002a108, 0xc8205b2120, 0x0, 0x0)
    /home/telebuild/go/src/github.com/influxdata/telegraf/agent.go:372 +0x8e6
main.main()
    /home/telebuild/go/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:234 +0x1ca8

goroutine 17 [syscall, 667 minutes, locked to thread]:
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1721 +0x1

goroutine 5 [syscall, 667 minutes]:
os/signal.loop()
    /usr/local/go/src/os/signal/signal_unix.go:22 +0x18
created by os/signal.init.1
    /usr/local/go/src/os/signal/signal_unix.go:28 +0x37

goroutine 11688 [chan send, 94 minutes]:
github.com/influxdata/telegraf.(*accumulator).AddFields(0xc820162600, 0x11121c8, 0x3, 0x0, 0xc8205487e0, 0xc8208ba5a0, 0x1, 0x1)
    /home/telebuild/go/src/github.com/influxdata/telegraf/accumulator.go:163 +0xcc8
github.com/influxdata/telegraf/plugins/inputs/system.(*CPUStats).Gather(0xc820159c80, 0x7fee0cda3748, 0xc820162600, 0x0, 0x0)
    /home/telebuild/go/src/github.com/influxdata/telegraf/plugins/inputs/system/cpu.go:99 +0x14aa
github.com/influxdata/telegraf.(*Agent).gatherParallel.func1(0xc8205c0af0, 0xc8205b2420, 0xc82002a108, 0x0, 0xc820159ce0)
    /home/telebuild/go/src/github.com/influxdata/telegraf/agent.go:137 +0x490
created by github.com/influxdata/telegraf.(*Agent).gatherParallel
    /home/telebuild/go/src/github.com/influxdata/telegraf/agent.go:141 +0x442

goroutine 35 [select, 667 minutes, locked to thread]:
runtime.gopark(0x13416f0, 0xc820511f28, 0x1117150, 0x6, 0xc820479018, 0x2)
    /usr/local/go/src/runtime/proc.go:185 +0x163
runtime.selectgoImpl(0xc820511f28, 0x0, 0x18)
    /usr/local/go/src/runtime/select.go:392 +0xa64
runtime.selectgo(0xc820511f28)
    /usr/local/go/src/runtime/select.go:212 +0x12
runtime.ensureSigM.func1()
    /usr/local/go/src/runtime/signal1_unix.go:227 +0x353
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1721 +0x1

goroutine 36 [chan receive, 667 minutes]:
main.main.func1(0xc8205b2180, 0xc8205b2120, 0xc820504fc0)
    /home/telebuild/go/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:206 +0x47
created by main.main
    /home/telebuild/go/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:216 +0x1575

goroutine 37 [semacquire, 109 minutes]:
sync.runtime_Semacquire(0xc82000db0c)
    /usr/local/go/src/runtime/sema.go:43 +0x26
sync.(*WaitGroup).Wait(0xc82000db00)
    /usr/local/go/src/sync/waitgroup.go:126 +0xb4
github.com/influxdata/telegraf.(*Agent).flush(0xc82002a108)
    /home/telebuild/go/src/github.com/influxdata/telegraf/agent.go:263 +0xe9
github.com/influxdata/telegraf.(*Agent).flusher(0xc82002a108, 0xc8205b2120, 0xc8205b2420, 0x0, 0x0)
    /home/telebuild/go/src/github.com/influxdata/telegraf/agent.go:281 +0x23f
github.com/influxdata/telegraf.(*Agent).Run.func1(0xc820236320, 0xc82002a108, 0xc8205b2120, 0xc8205b2420)
    /home/telebuild/go/src/github.com/influxdata/telegraf/agent.go:337 +0x7f
created by github.com/influxdata/telegraf.(*Agent).Run
    /home/telebuild/go/src/github.com/influxdata/telegraf/agent.go:341 +0x4b3

goroutine 11219 [select, 109 minutes]:
net/http.(*persistConn).writeLoop(0xc82026e160)
    /usr/local/go/src/net/http/transport.go:1009 +0x40c
created by net/http.(*Transport).dialConn
    /usr/local/go/src/net/http/transport.go:686 +0xc9d

goroutine 11685 [chan send, 94 minutes]:
github.com/influxdata/telegraf.(*accumulator).AddFields(0xc8205e2380, 0x1117bf0, 0x6, 0x0, 0xc8207f9a20, 0x0, 0x0, 0x0)
    /home/telebuild/go/src/github.com/influxdata/telegraf/accumulator.go:163 +0xcc8
github.com/influxdata/telegraf/plugins/inputs/system.(*SystemStats).Gather(0x18c7c58, 0x7fee0cda3748, 0xc8205e2380, 0x0, 0x0)
    /home/telebuild/go/src/github.com/influxdata/telegraf/plugins/inputs/system/system.go:40 +0x416
github.com/influxdata/telegraf.(*Agent).gatherParallel.func1(0xc8205c0af0, 0xc8205b2420, 0xc82002a108, 0x0, 0xc820159b60)
    /home/telebuild/go/src/github.com/influxdata/telegraf/agent.go:137 +0x490
created by github.com/influxdata/telegraf.(*Agent).gatherParallel
    /home/telebuild/go/src/github.com/influxdata/telegraf/agent.go:141 +0x442

goroutine 11690 [chan send, 94 minutes]:
github.com/influxdata/telegraf.(*accumulator).AddFields(0xc820640980, 0x1112708, 0x6, 0x0, 0xc8205d04b0, 0x0, 0x0, 0x0)
    /home/telebuild/go/src/github.com/influxdata/telegraf/accumulator.go:163 +0xcc8
github.com/influxdata/telegraf/plugins/inputs/system.(*DiskIOStats).Gather(0xc820159d70, 0x7fee0cda3748, 0xc820640980, 0x0, 0x0)
    /home/telebuild/go/src/github.com/influxdata/telegraf/plugins/inputs/system/disk.go:132 +0x9c7
github.com/influxdata/telegraf.(*Agent).gatherParallel.func1(0xc8205c0af0, 0xc8205b2420, 0xc82002a108, 0x0, 0xc820159dd0)
    /home/telebuild/go/src/github.com/influxdata/telegraf/agent.go:137 +0x490
created by github.com/influxdata/telegraf.(*Agent).gatherParallel
    /home/telebuild/go/src/github.com/influxdata/telegraf/agent.go:141 +0x442

goroutine 11366 [select, 109 minutes]:
net/http.(*persistConn).roundTrip(0xc82026e160, 0xc82000dcf0, 0x0, 0x0, 0x0)
    /usr/local/go/src/net/http/transport.go:1164 +0xb0d
net/http.(*Transport).RoundTrip(0xc82008a5a0, 0xc820348460, 0xa8, 0x0, 0x0)
    /usr/local/go/src/net/http/transport.go:235 +0x530
net/http.send(0xc820348460, 0x7fee0cd9b4d8, 0xc82008a5a0, 0xc8205584b0, 0x0, 0x0)
    /usr/local/go/src/net/http/client.go:220 +0x52c
net/http.(*Client).send(0xc820159ec0, 0xc820348460, 0x4e, 0x0, 0x0)
    /usr/local/go/src/net/http/client.go:143 +0x15a
net/http.(*Client).doFollowingRedirects(0xc820159ec0, 0xc820348460, 0x13412a0, 0x0, 0x0, 0x0)
    /usr/local/go/src/net/http/client.go:380 +0xbc0
net/http.(*Client).Do(0xc820159ec0, 0xc820348460, 0x28, 0x0, 0x0)
    /usr/local/go/src/net/http/client.go:178 +0x188
github.com/influxdata/influxdb/client/v2.(*client).Write(0xc8201a6180, 0x7fee0cda3778, 0xc8200b80c0, 0x0, 0x0)
    /home/telebuild/go/src/github.com/influxdata/influxdb/client/v2/client.go:382 +0x8e5
github.com/influxdata/telegraf/plugins/outputs/influxdb.(*InfluxDB).Write(0xc820153180, 0xc82000daf0, 0x2, 0x2, 0x0, 0x0)
    /home/telebuild/go/src/github.com/influxdata/telegraf/plugins/outputs/influxdb/influxdb.go:148 +0x3aa
github.com/influxdata/telegraf/internal/models.(*RunningOutput).Write(0xc820014300, 0x0, 0x0)
    /home/telebuild/go/src/github.com/influxdata/telegraf/internal/models/running_output.go:60 +0xb0
github.com/influxdata/telegraf.(*Agent).flush.func1(0xc82000db00, 0xc820014300)
    /home/telebuild/go/src/github.com/influxdata/telegraf/agent.go:255 +0x7a
created by github.com/influxdata/telegraf.(*Agent).flush
    /home/telebuild/go/src/github.com/influxdata/telegraf/agent.go:260 +0xc0

goroutine 11689 [chan send, 94 minutes]:
github.com/influxdata/telegraf.(*accumulator).AddFields(0xc82053d280, 0x1112700, 0x4, 0x0, 0xc8202489f0, 0x0, 0x0, 0x0)
    /home/telebuild/go/src/github.com/influxdata/telegraf/accumulator.go:163 +0xcc8
github.com/influxdata/telegraf/plugins/inputs/system.(*DiskStats).Gather(0xc8201a6080, 0x7fee0cda3748, 0xc82053d280, 0x0, 0x0)
    /home/telebuild/go/src/github.com/influxdata/telegraf/plugins/inputs/system/disk.go:63 +0x8e8
github.com/influxdata/telegraf.(*Agent).gatherParallel.func1(0xc8205c0af0, 0xc8205b2420, 0xc82002a108, 0x0, 0xc820159d40)
    /home/telebuild/go/src/github.com/influxdata/telegraf/agent.go:137 +0x490
created by github.com/influxdata/telegraf.(*Agent).gatherParallel
    /home/telebuild/go/src/github.com/influxdata/telegraf/agent.go:141 +0x442

goroutine 11687 [chan send, 92 minutes]:
github.com/influxdata/telegraf.(*accumulator).AddFields(0xc8201a62c0, 0x1115308, 0x7, 0x0, 0xc8203691d0, 0x0, 0x0, 0x0)
    /home/telebuild/go/src/github.com/influxdata/telegraf/accumulator.go:163 +0xcc8
github.com/influxdata/telegraf/plugins/inputs/system.(*NetStats).Gather(0xc820434550, 0x7fee0cda3748, 0xc8201a62c0, 0x0, 0x0)
    /home/telebuild/go/src/github.com/influxdata/telegraf/plugins/inputs/system/netstat.go:61 +0x1063
github.com/influxdata/telegraf.(*Agent).gatherParallel.func1(0xc8205c0af0, 0xc8205b2420, 0xc82002a108, 0x0, 0xc820159c50)
    /home/telebuild/go/src/github.com/influxdata/telegraf/agent.go:137 +0x490
created by github.com/influxdata/telegraf.(*Agent).gatherParallel
    /home/telebuild/go/src/github.com/influxdata/telegraf/agent.go:141 +0x442

goroutine 11218 [IO wait, 110 minutes]:
net.runtime_pollWait(0x7fee09cd1f28, 0x72, 0xc82000c160)
    /usr/local/go/src/runtime/netpoll.go:157 +0x60
net.(*pollDesc).Wait(0xc8202ee530, 0x72, 0x0, 0x0)
    /usr/local/go/src/net/fd_poll_runtime.go:73 +0x3a
net.(*pollDesc).WaitRead(0xc8202ee530, 0x0, 0x0)
    /usr/local/go/src/net/fd_poll_runtime.go:78 +0x36
net.(*netFD).Read(0xc8202ee4d0, 0xc820479000, 0x1000, 0x1000, 0x0, 0x7fee0cd97050, 0xc82000c160)
    /usr/local/go/src/net/fd_unix.go:232 +0x23a
net.(*conn).Read(0xc82011e028, 0xc820479000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
    /usr/local/go/src/net/net.go:172 +0xe4
crypto/tls.(*block).readFromUntil(0xc82042c780, 0x7fee0cda1d18, 0xc82011e028, 0x5, 0x0, 0x0)
    /usr/local/go/src/crypto/tls/conn.go:455 +0xcc
crypto/tls.(*Conn).readRecord(0xc8200d1b80, 0x1341717, 0x0, 0x0)
    /usr/local/go/src/crypto/tls/conn.go:540 +0x2d1
crypto/tls.(*Conn).Read(0xc8200d1b80, 0xc820013000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
    /usr/local/go/src/crypto/tls/conn.go:901 +0x167
net/http.noteEOFReader.Read(0x7fee0cda3598, 0xc8200d1b80, 0xc82026e1b8, 0xc820013000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
    /usr/local/go/src/net/http/transport.go:1370 +0x67
net/http.(*noteEOFReader).Read(0xc8205fac00, 0xc820013000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
    <autogenerated>:126 +0xd0
bufio.(*Reader).fill(0xc8205b3560)
    /usr/local/go/src/bufio/bufio.go:97 +0x1e9
bufio.(*Reader).Peek(0xc8205b3560, 0x1, 0x0, 0x0, 0x0, 0x0, 0x0)
    /usr/local/go/src/bufio/bufio.go:132 +0xcc
net/http.(*persistConn).readLoop(0xc82026e160)
    /usr/local/go/src/net/http/transport.go:876 +0xf7
created by net/http.(*Transport).dialConn
    /usr/local/go/src/net/http/transport.go:685 +0xc78

rax    0xca
rbx    0x0
rcx    0xffffffffffffffff
rdx    0x0
rdi    0x18a20b0
rsi    0x0
rbp    0x1
rsp    0x7fff56bb0e80
r8     0x0
r9     0x0
r10    0x0
r11    0x286
r12    0x9
r13    0x133ba7c
r14    0x30
r15    0x8
rip    0x460cd1
rflags 0x286
cs     0x33
fs     0x0
gs     0x0
sparrc commented 8 years ago

@wgrcunha what is your telegraf version?

wgrcunha commented 8 years ago

@sparrc 0.10.2

wgrcunha commented 8 years ago

@sparrc my bad, this one is 0.10.1

sparrc commented 8 years ago

got it, that makes more sense, thanks for the great debug output BTW

sparrc commented 8 years ago

Looks like I haven't set a timeout for the InfluxDB writes, which are hanging, that is definitely a bug.

As to the cause of the write hang, it could be that the db is overstressed with http write requests.

@wgrcunha if you have 1000s of hosts, you probably want to be using the flush_jitter config option, are you doing this? Until I get the timeout fixed in a release, I think you have two options:

  1. Use UDP on InfluxDB and Telegraf
  2. Use the flush_jitter telegraf config option to spread your http writes over a larger span of time. (if you are not using this, then all of your hosts are trying to write to influx at exactly the same time)
wgrcunha commented 8 years ago

flush_jitter is no 0s, i will change this and look. I send a feedback on some days.

Thanks!

sparrc commented 8 years ago

That was a little inaccurate, there is actually a timeout parameter for the influxdb output, so you should set that to something reasonable (say "5s" or "10s"?)

The current default config of having no timeout was a pretty dumb default implementation on my part, I will change that to a reasonable default.

sparrc commented 8 years ago

Just to be clear, I would recommend doing both:

  1. Set flush_jitter to at least 10s in your case (because you have 1000s of hosts)
  2. Set timeout under outputs.influxdb to "5s" so that hangs can't occur
wgrcunha commented 8 years ago

Changing now, i can give a feedback on a few days.

tyvm for fast response :)

wgrcunha commented 8 years ago

@sparrc The feedback is very positive, everything is ok now using that configuration :)