gliderlabs / logspout

Log routing for Docker container logs
MIT License
4.65k stars 681 forks source link

Is there a health check of any kind on logspout?? #207

Closed MaxFlanders closed 7 years ago

MaxFlanders commented 8 years ago

I have logspout set to publish its logs to a port, and typically I can view its output by curling that port, however I am currently getting no response. If I restart the container, it will typically solve this problem, but I'm trying to find a root cause. Is there a way to health check an existing logspout container to see why it is not responding before I restart it??

josegonzalez commented 8 years ago

Is there no response from docker logs CID?

MaxFlanders commented 8 years ago

There was a response from docker logs in the cases that I saw.

rosskukulinski commented 8 years ago

Also now watching this issue. I'm seeing similar behavior. Available and happy to help debug if you can suggest a place for me to start.

Noticed I'm using an old version -- will be updating to latest release tomorrow and seeing what happens.

chrishiestand commented 8 years ago

I've been having (probably) the same problem for about 2 weeks. I've been collecting data otherwise I would have submitted the issue sooner. I'm using gliderlabs/logspout as the image for a kubernetes v1.2.4 daemonset and it runs fine for some number of days before logs cease. There are no error messages, logspout just seems to stop working (though the container and processes stay running). I've got plenty of space left on papertrail, and if I restart the containers/pods logging resumes normally so I think we can rule out papertrail as the source problem.

I will try and leave the broken pod in the current broken state so I can collect more information from it or the node, please send me any requests for info you have.

When a container breaks, the most noticeable issue is that logs cease to be sent to papertrail. The container stays running. If I port forward to the pod so I can connect to the container's http server my curl command just waits and never outputs anything

kubectl port-forward logspout-papertrail-rsjok 9002:80

curl localhost:9002/logs # curl stays running, no output

The above commands work fine and output logs when I connect to a working logspout container.

I can log into the kubernetes node and strace the logspout pid:

root@gke-cluster-default-pool-4efa2519-rpot:~# strace -p 27341 -ff
Process 27341 attached with 7 threads - interrupt to quit
[pid 27348] futex(0xc82000d310, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 27351] epoll_wait(4,  <unfinished ...>
[pid 26309] restart_syscall(<... resuming interrupted call ...> <unfinished ...>
[pid 27352] futex(0xc8200d8590, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 27347] futex(0xc82000ce90, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 27341] futex(0xabbf90, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 27346] restart_syscall(<... resuming interrupted call ...> <unfinished ...>
[pid 26309] <... restart_syscall resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 26309] clock_gettime(CLOCK_MONOTONIC, {1525060, 176301571}) = 0
[pid 26309] futex(0xabb660, FUTEX_WAKE, 1) = 1
[pid 27346] <... restart_syscall resumed> ) = 0
[pid 26309] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid 27346] sched_yield( <unfinished ...>
[pid 26309] <... clock_gettime resumed> {1525060, 176491143}) = 0
[pid 27346] <... sched_yield resumed> ) = 0
[pid 26309] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid 27346] futex(0xabb5a0, FUTEX_WAKE, 1 <unfinished ...>
[pid 26309] <... clock_gettime resumed> {1468265339, 321153236}) = 0
[pid 27346] <... futex resumed> )       = 0
[pid 26309] futex(0xc8200d8590, FUTEX_WAKE, 1 <unfinished ...>
[pid 27346] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid 26309] <... futex resumed> )       = 1
[pid 27352] <... futex resumed> )       = 0
[pid 27346] <... clock_gettime resumed> {1525060, 176702791}) = 0
[pid 26309] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid 27352] select(0, NULL, NULL, NULL, {0, 100} <unfinished ...>
[pid 27346] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid 26309] <... clock_gettime resumed> {1525060, 176861054}) = 0
[pid 27346] <... clock_gettime resumed> {1468265339, 321421026}) = 0
[pid 26309] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid 27346] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
...

logspout just seems to be polling and not noticing any log events.

chrishiestand commented 8 years ago

The docker image ID of my current broken container is d36a70e7d3e0b189815821631e38065f638b4ce35c632fae74998ea1af2ab67a

josegonzalez commented 8 years ago

Thats interesting. Do you notice any other issues around the time it stops shipping data?

We could/should probably add statsd support so we can pinpoint how much data/when things stop being shipped. I'd happily look at such a pr and merge it for our next release.

chrishiestand commented 8 years ago

@josegonzalez No other issues around the time. This cluster is a low-traffic cluster so it should be pretty easy to tell if something is spiking or otherwise not working correctly.

chrishiestand commented 8 years ago

In other words, I think this is an issue of basic functionality breaking, not a resource related edge case.

josegonzalez commented 8 years ago

Are you using latest master or something else? We may have fixed your problem here https://github.com/gliderlabs/logspout/pull/191.

chrishiestand commented 8 years ago

I've been using :latest and not :master - I'll switch and see if that fixes the problem thanks.

chrishiestand commented 8 years ago

@josegonzalez So I've switched to docker master tag (image id: e3a7c3c77162f1fa920194aeb890e4f5ea073b66d9202e2b5db258eae6bef3ab) and I'm still having problems though they might be different. The end result is similar, logs are not getting from the containers to papertrail.

The behavior I'm seeing is a small number of containers' logs get through. As I'm writing this only one container's logs, from one host, are getting into papertrail. If I connect directly via http to the logspout container I see the same behavior. If I use kubectl to read the docker logs then I can read them as expected - logging is still happening but logspout doesn't seem to know about it.

I have recently returned to papertrail at my new job and I'm surprised to be having these issues. I never had a single problem with the now deprecated progrium/logspout container.

chrishiestand commented 8 years ago

I switched back to progrium/logspout and the problems went away. I'll stop contributing to this thread since it's not directly related to the original topic. Sorry for hijacking it.

MaxFlanders commented 8 years ago

I am still seeing it intermittently unfortunately. Today I saw a new error though, I tried to curl to follow the log output of a container and got this response:

curl: (18) transfer closed with outstanding read data remaining

which was new. Restarting the logspout container fixed it though.

We are considering pursuing an alternate method of using logback to write logs directly to kafka due to these issues unfortunately :/

josegonzalez commented 8 years ago

@chrishiestand interesting. gliderlabs/logspout is essentially progrium/logspout, except further along in development. I'm not personally sure when things "broke" in such a way that we'd no longer be handling those cases, but unfortunately test coverage isn't great. There are a few PRs that should help ameliorate the issue, but they are still in a state of flux.

If someone would like to fund the work to investigate/fix any such issues, I'd be happy to talk, but otherwise this will have to take a backseat to other, paid work that the maintainers have.

edude03 commented 8 years ago

I'm using logspout on marathon and having this would alleviate #215 for me (especially combined with #201 ) anyone interested in a bounty?

ebr commented 8 years ago

From struggling with suddenly missing logs for the past several weeks, I'm quite sure at least one of the causes is log rotation when using the json-file driver. we switched to journald logging driver for docker, and the issue went away.

@edude03 give the brodsky/logspout:latest fork a try until #201 is merged - it's essentially a build of #201, which I just rebased.

BertHartm commented 7 years ago

I'm hitting this issue when using logspout on kubernetes. I switched to UDP instead of TCP and the issue seems to have been mitigated some. It's gone from seeming to stop within 24 hours to having been up for a few days. I'm wondering if it's volume related since I don't normally hit this on quieter machines, but the kubernetes setup I'm using is much more active.

I had been using the syslog+tls scheme on the path passed in as an arg, and changed it to be syslog://

michaelshobbs commented 7 years ago

Seems like this issue has meandered a bit and looks like a few things are discussed here. I'm going to close this and those that are still having a specific issue, please open a dedicated issue for that.

Note: there is a known issue for log flows halting if using docker's --log-opt max-size= option

wkruse commented 6 years ago

Related to #216.

jmarcos-cano commented 6 years ago

I have created this gist to show how I solved this by checking if logspout is sending outgoing traffic (TCP or UDP, which are my current needs).. im using this as a swarm healthcheck