gliderlabs / logspout

Log routing for Docker container logs
MIT License
4.66k stars 681 forks source link

CPU usage on dockerd 100% #246

Open runningman84 opened 7 years ago

runningman84 commented 7 years ago

If I start logspout-logstash using this docker-compose the cpu usage spikes to 100% and stays there...

  logspout:
    image: bekt/logspout-logstash
    mem_limit: 256m
    restart: on-failure
    depends_on:
      - logstash
    environment:
      ROUTE_URIS: logstash://logstash:5000
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock

before starting logspout

 7015 root      20   0 1423204  80956  29440 S 106.3  1.3   1:25.72 dockerd       

after starting logspout

 7015 root      20   0 1423204  80972  29440 S   0.3  1.3   1:58.57 dockerd  
garceri commented 7 years ago

Experiencing the same issue here:

$logspout --version
v3.2-dev-custom

$docker version Client: Version: 1.12.3 API version: 1.24 Go version: go1.6.3 Git commit: 6b644ec Built: Wed Oct 26 23:26:11 2016 OS/Arch: linux/amd64

Server: Version: 1.12.3 API version: 1.24 Go version: go1.6.3 Git commit: 6b644ec Built: Wed Oct 26 23:26:11 2016 OS/Arch: linux/amd64

[root@ip-XXXXXXX rancher]# docker info Containers: 56 Running: 44 Paused: 0 Stopped: 12 Images: 89 Server Version: 1.12.3 Storage Driver: overlay Backing Filesystem: extfs Logging Driver: json-file Cgroup Driver: cgroupfs Plugins: Volume: local rancher-nfs Network: null bridge overlay host Swarm: inactive Runtimes: runc Default Runtime: runc Security Options: seccomp Kernel Version: 4.4.24-rancher Operating System: RancherOS v0.7.1 OSType: linux Architecture: x86_64 CPUs: 2 Total Memory: 14.94 GiB Name: ip-XXXXXXXXXXX.ec2.internal ID: DVSV:NLW5:VD5U:UOYY:FIOB:4RTH:7NGQ:GIBO:KKTR:VPYO:J6A2:4SZE Docker Root Dir: /var/lib/docker Debug Mode (client): false Debug Mode (server): false Registry: https://index.docker.io/v1/ Insecure Registries: 127.0.0.0/8

Logspout container logs with DEBUG enabled show a constant stream of:

1/26/2017 12:37:44 PM2017/01/26 17:37:44 pump.pumpLogs(): 495069bafc93 stopped
1/26/2017 12:37:45 PM2017/01/26 17:37:45 pump.pumpLogs(): 495069bafc93 started
1/26/2017 12:37:45 PM2017/01/26 17:37:45 pump.pumpLogs(): 25635c95f804 stopped
1/26/2017 12:37:45 PM2017/01/26 17:37:45 pump.pumpLogs(): 25635c95f804 started
1/26/2017 12:37:46 PM2017/01/26 17:37:46 pump.pumpLogs(): 495069bafc93 stopped
1/26/2017 12:37:46 PM2017/01/26 17:37:46 pump.pumpLogs(): 495069bafc93 started
1/26/2017 12:37:46 PM2017/01/26 17:37:46 pump.pumpLogs(): a27d4bfb3a41 stopped
1/26/2017 12:37:46 PM2017/01/26 17:37:46 pump.pumpLogs(): a27d4bfb3a41 started
1/26/2017 12:37:46 PM2017/01/26 17:37:46 pump.pumpLogs(): 32e734dcf238 stopped
1/26/2017 12:37:46 PM2017/01/26 17:37:46 pump.pumpLogs(): 32e734dcf238 started
1/26/2017 12:37:47 PM2017/01/26 17:37:47 pump.pumpLogs(): 495069bafc93 stopped
1/26/2017 12:37:47 PM2017/01/26 17:37:47 pump.pumpLogs(): 495069bafc93 started
1/26/2017 12:37:48 PM2017/01/26 17:37:48 pump.pumpLogs(): 495069bafc93 stopped
1/26/2017 12:37:48 PM2017/01/26 17:37:48 pump.pumpLogs(): a27d4bfb3a41 stopped
1/26/2017 12:37:48 PM2017/01/26 17:37:48 pump.pumpLogs(): 495069bafc93 started
1/26/2017 12:37:49 PM2017/01/26 17:37:48 pump.pumpLogs(): a27d4bfb3a41 started
mattatcha commented 7 years ago

@pythianarceri Are you able to provide a docker-compose.yml I can use to reproduce the issue?

garceri commented 7 years ago

We use a custom built image compiled with the logstash output module,

logspout:
    image: pythiant9shared/logspout-logstash
    environment:
      DEBUG: '1'
      ROUTE_URIS: logstash://logstash:5000
    stdin_open: true
    external_links:
    - logstash/logstash-collector:logstash
    volumes:
    - /var/run/docker.sock:/var/run/docker.sock
    tty: true
    labels:
      io.rancher.scheduler.global: 'true'
      io.rancher.container.hostname_override: container_name
garceri commented 7 years ago

Any luck getting this fixed or at least the issue identified, is it normal for the pumps to be started/stopped all the time ? We are running under RancherOS 1.3.3 just in case

mattdodge commented 7 years ago

Same thing here. I end up seeing a lot of these messages in /var/log/docker.log

time="2017-03-13T21:36:26.491710944Z" level=error msg="Error streaming logs: invalid character '\\x00' looking for beginning of value"
time="2017-03-13T21:36:26.514911587Z" level=error msg="Error streaming logs: invalid character '\\x00' looking for beginning of value"
time="2017-03-13T21:36:26.547530212Z" level=error msg="Error streaming logs: invalid character '\\x00' looking for beginning of value"

It's happened a couple times. Sometimes that "invalid character" is different but the gist of the log message is the same. We see dockerd up around 100% CPU and a constant stream of those errors.

byronwilliams commented 7 years ago

Hi all,

We've managed to reproduce the same issue as @pythianarceri - on Jan 26. To add some detail; running a docker logs -f {ContainerID} does not actually follow the logs and exits straight away. This would make sense with the pumpLogs starting and stopping as the HTTP connection being closed for some reason.

This is reproducible on the latest master commit (7c22df4e9d780959b266d7a088ff8012171edbfb) as well as 513b74c70a31c20294507f3fb21c34e52b2df48e and 8b98313954b3283d7203d1c3984b5b69ae307e0b.

Our container is running a few processed inside, all of which have PYTHONUNBUFFERED=1.

Whilst I consider this to be a bug within our container not outputting to stdout and stderr correctly it also makes our docker nodes perform really badly when the containes are scheduled.

Can we add some kind of counter when the pump starts and if it restarts {n} times within a second make it backoff?.

# docker version
Client:
 Version:      17.03.0-ce
 API version:  1.26
 Go version:   go1.7.4
 Git commit:   60ccb22
 Built:        Mon Mar 13 18:07:51 UTC 2017
 OS/Arch:      linux/arm

Server:
 Version:      17.03.0-ce
 API version:  1.26 (minimum version 1.12)
 Go version:   go1.7.4
 Git commit:   60ccb22
 Built:        Mon Mar 13 18:07:51 UTC 2017
 OS/Arch:      linux/arm
 Experimental: false
woutor commented 7 years ago

Same issue here. Reverted back to docker image v3.1 and logging and CPU usage is back to normal.

sebgl commented 7 years ago

Same issue, we also reverted to docker image v3.1.

We are running Docker version 17.04.0-ce, build 4845c56, and deploying our logspout container with the new swarm mode ('docker stack deploy'). We're not using logstash, but forward logs to a remote syslog directly.

Here is the content of our docker-compose.yml file:

version: '3'

services:
  logspout:
    image: gliderlabs/logspout
    environment:
      SYSLOG_STRUCTURED_DATA: "- $SYSLOG_CONF"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    ports:
      - 10888:80
    command: syslog+tls://$LOGS_ENDPOINT
    deploy:
      mode: global
gaieges commented 7 years ago

I have the same problem, using the most recent master image (ef55cfcceda7) and version v3.3-dev on both docker 17.04.0-ce and 1.13.1.

It's fairly difficult to reproduce, and seems to occur only after a period of time - I'm not sure whether its a result of complex log messages (or potentially unexpected charsets), or just the volume of messages shipped off.

During the time that it's at high cpu usage, I also have seen a number of weird messages such as what mattdodge described

@woutor @sebgl after you reverted back to version 3.1 - did you see the problem pop up again?

woutor commented 7 years ago

@gaieges: Nope, with 3.1 the CPU usage remains normal.

gdurr commented 7 years ago

Same issue there with Docker 17.05.0-ce and logspout 3.3. We can't go back to 3.1 cause we need label filtering feature.

avoinea commented 7 years ago

Same issue here. Reverted back to docker image v3.1 and logging and CPU usage is back to normal.

michaelshobbs commented 7 years ago

I'm using v3.3 in production with papertrail. Is there anybody experiencing this issue that is NOT using logstash? I'm thinking the problem might be in that module...

gaieges commented 7 years ago

@michaelshobbs I don't use the logstash module, I just use the tcp+syslog output.

michaelshobbs commented 7 years ago

Interesting....anybody have any thoughts on this one?

SqiSch commented 7 years ago

Same issue here, but i cannot ack the high cpu load. Only a lot loglines in /var/log/syslog and the pump.pumpLogs() when debugging enabled.

i also using the logstash adapter

.....
165+02:00" level=error msg="Error streaming logs: invalid character 'l' after object key:value pair"
Jun 16 11:55:18 xyz docker[11404]: time="2017-06-16T11:55:18.307515574+02:00" level=error msg="Error streaming logs: invalid character 'l' after object key:value pair"
Jun 16 11:55:18 xyz docker[11404]: time="2017-06-16T11:55:18.432245150+02:00" level=error msg="Error streaming logs: invalid character 'l' after object key:value pair"
....
mrbobbytables commented 7 years ago

We encountered the same issue today randomly on one of hosts. I unfortunately do not have anymore info than what others have reported. Docker 17.03.1-ce RancherOS v1.0.2 logspout configured to send to syslog+tcp

sfrique commented 7 years ago

Hello folks,

I was having the same problem:

Jun 28 14:07:20 xxx docker[7686]: time="2017-06-28T14:07:20.892703572Z" level=error msg="Error streaming logs: invalid character 'l' after object key:value pair"
Jun 28 14:07:20 xxx docker[7686]: time="2017-06-28T14:07:20.894972987Z" level=error msg="Error streaming logs: invalid character 'l' after object key:value pair"

and it turns out to be corrupt json-file.log, it happened when I got a disk full so json-file.log was truncated. I had some thing like this at my json-file.log after disk space got fixed and docker running again.

{"log":"    2017/06/28 07:46:00 [INFO] consul: adding WAN ser{"log":"    2017/06/28 07:46:00 [INFO] consul: adding WAN server xxx.dc1 (Addr: 10.240.0.21:8300) (DC: dc1)\n","stream":"stdout","time":"2017-06-28T07:46:00.5738544Z"}

So it isn't a valid json and it breaks docker logs, somehow it make logspout enter a loop and make docker print those lines.

If you try to run, docker logs $container on the container with that corrupt json-file.log, you will get one line of the error.

So look for corrupted/bad formatted json-files.log

I just removed that line and everything was working normally again.

Hope that helps.

gaieges commented 7 years ago

On @sfrique 's note - I wonder if using journald log driver (if possible) instead of json-file would help this?

alhafoudh commented 7 years ago

@gaieges I tried with journald, same problem

pauln commented 7 years ago

We're seeing the same thing - 100% CPU usage when running logspout (pumping to logstash). Enabling logspout's debug logs showed rapid repetition of log entries such as:

21/07/2017 11:57:202017/07/20 23:57:20 pump.pumpLogs(): fb6033455fd0 stopped
21/07/2017 11:57:202017/07/20 23:57:20 pump.pumpLogs(): fb6033455fd0 started

Initially, we were seeing this on most of our hosts, with two container IDs showing up on each. The containers in question were Rancher's metadata and DNS containers. We upgraded Rancher from 1.6.2 to 1.6.5 and upgraded all the system services, and logspout now seems to run happily on most of our hosts. However, we're still getting the stopped/started log spam (and corresponding CPU load) on our DB hosts - with only a single container ID flopping: mongodb.

garceri commented 7 years ago

The scenario described by @pauln is exactly what we are experiencing, (we haven't had chance to upgrade to 1.6.5 yet), Out of curiosity @pauln , what docker version are you running on your hosts ?

pauln commented 7 years ago

@garceri, we're currently running Docker 17.03 CE, though we're looking at upgrading to 17.06 CE in the near future.

robcza commented 7 years ago

My thoughts on this issue:

2017/07/26 07:09:48 pump.pumpLogs(): 21c90bb7ac5c stopped
2017/07/26 07:09:48 pump.pumpLogs(): 21c90bb7ac5c started
2017/07/26 07:09:48 pump.pumpLogs(): 21c90bb7ac5c stopped
2017/07/26 07:09:48 pump.pumpLogs(): 21c90bb7ac5c started
michaelshobbs commented 7 years ago

@robcza thanks for the additional details. can you provide a full debug log from start to stop of the logspout container that has entered into this state? looking for any other clues here.

To your question about difference between versions, we tag each release and thus you can browse the history by comparing tags. Additionally, we have a CHANGELOG.md with convenient links to compare each version.

ko-christ commented 7 years ago

Same for me on docker 17.03.1-ce using this image bekt/logspout-logstash:latest.

I have managed to get a full debug log since logspout container creation. There is a constant start / stop stream for c21bde2e91e7.
I had this issue on other machines as well with various services.

# docker logs elk-logspout 
2017/07/28 10:50:57 setting allowTTY to: false
# logspout v3.3-dev-custom by gliderlabs
# adapters: raw udp tcp logstash
# options : debug:True persist:/mnt/routes
# jobs    : http[]:80 pump routes
# routes  :
#   ADAPTER ADDRESS     CONTAINERS  SOURCES OPTIONS
#   logstash+tcpprod00:1111             map[]
2017/07/28 10:50:57 pump.Run(): using inactivity timeout:  0s
2017/07/28 10:50:57 pump.pumpLogs(): 4dd58c94144a ignored: environ ignore
2017/07/28 10:50:57 pump.pumpLogs(): 741c807703d4 started
2017/07/28 10:50:57 pump.pumpLogs(): 03c007defa61 started
2017/07/28 10:50:57 pump.pumpLogs(): 0398b9b5542e started
2017/07/28 10:50:57 pump.pumpLogs(): 0b1b04fd17a9 started
2017/07/28 10:50:57 pump.pumpLogs(): 18b4f33a6b1d started
2017/07/28 10:50:57 pump.pumpLogs(): 18cd5d3b88cb started
2017/07/28 10:50:57 pump.pumpLogs(): dfebc3118ce3 started
2017/07/28 10:50:57 pump.pumpLogs(): e12708a548c4 started
2017/07/28 10:50:57 pump.pumpLogs(): 78423dfc9805 started
2017/07/28 10:50:57 pump.pumpLogs(): 872c498a0c2f started
2017/07/28 10:50:57 pump.pumpLogs(): 2f78048ee356 started
2017/07/28 10:50:57 pump.pumpLogs(): bcedcc0cd5d2 started
2017/07/28 10:50:57 pump.pumpLogs(): 0b8e0146d61c started
2017/07/28 10:50:57 pump.pumpLogs(): a1c7966d4ad3 started
2017/07/28 10:50:57 pump.pumpLogs(): d49788286a32 started
2017/07/28 10:50:57 pump.pumpLogs(): 41c5e05d0019 started
2017/07/28 10:50:57 pump.pumpLogs(): ce7030af1846 started
2017/07/28 10:50:57 pump.pumpLogs(): 78cff52b2cc8 started
2017/07/28 10:50:57 pump.pumpLogs(): c5b69f63a3e8 started
2017/07/28 10:50:57 pump.pumpLogs(): 52ab8eb7e542 started
2017/07/28 10:50:57 pump.pumpLogs(): f26b009798f4 started
2017/07/28 10:50:57 pump.pumpLogs(): 7b1f7d90859a started
2017/07/28 10:50:57 pump.pumpLogs(): a5b864e479b7 started
2017/07/28 10:50:57 pump.pumpLogs(): 08581c7c3829 started
2017/07/28 10:50:57 pump.pumpLogs(): 9e4497a0dcaf started
2017/07/28 10:50:57 pump.pumpLogs(): 227ca66d7f83 started
2017/07/28 10:50:57 pump.pumpLogs(): 6f6b327b7b9e started
2017/07/28 10:50:57 pump.pumpLogs(): 846ba10f0ee7 started
2017/07/28 10:50:57 pump.pumpLogs(): e672bc2d0fe2 started
2017/07/28 10:50:57 pump.pumpLogs(): f1d48ee0211f started
2017/07/28 10:50:57 pump.pumpLogs(): f8fdca2aff57 started
2017/07/28 10:50:57 pump.pumpLogs(): 5856ed01b7ff started
2017/07/28 10:50:57 pump.pumpLogs(): 57e77510b9de started
2017/07/28 10:50:57 pump.pumpLogs(): a05333be7c0f started
2017/07/28 10:50:57 pump.pumpLogs(): dcab95f7ca2f started
2017/07/28 10:50:57 pump.pumpLogs(): f28130fa353a started
2017/07/28 10:50:57 pump.pumpLogs(): 464d4d0b24bd started
2017/07/28 10:50:57 pump.pumpLogs(): 139fccfc578f started
2017/07/28 10:50:57 pump.pumpLogs(): 341f311f43c0 started
2017/07/28 10:50:57 pump.pumpLogs(): f08a73de23ab started
2017/07/28 10:50:57 pump.pumpLogs(): 7e88c606f51b started
2017/07/28 10:50:57 pump.pumpLogs(): c21bde2e91e7 started
2017/07/28 10:50:57 pump.pumpLogs(): 52fe6826666a started
2017/07/28 10:51:02 pump.pumpLogs(): c21bde2e91e7 stopped
2017/07/28 10:51:02 pump.pumpLogs(): c21bde2e91e7 started
2017/07/28 10:51:06 pump.pumpLogs(): c21bde2e91e7 stopped
2017/07/28 10:51:06 pump.pumpLogs(): c21bde2e91e7 started
2017/07/28 10:51:09 pump.pumpLogs(): c21bde2e91e7 stopped
2017/07/28 10:51:09 pump.pumpLogs(): c21bde2e91e7 started
2017/07/28 10:51:11 pump.pumpLogs(): c21bde2e91e7 stopped
2017/07/28 10:51:11 pump.pumpLogs(): c21bde2e91e7 started
2017/07/28 10:51:14 pump.pumpLogs(): c21bde2e91e7 stopped
2017/07/28 10:51:14 pump.pumpLogs(): c21bde2e91e7 started
michaelshobbs commented 7 years ago

Thank @ko-christ for more detail! Next time this happens it would be great to see if there's anything odd about the output of the container with the "thrashing" log pump. (i.e. docker logs c21bde2e91e7 in this particular case)

Same request to others that have noticed the stop/start thrash symptom.

shahar-davidson commented 7 years ago

Hi all, We've just encountered this issue recently. (logspout starts and CPU jumps to ~100%) Here are my insights after investigating this for a couple of hours: In our environment, this was caused due to malformed JSON output in the container logs (when using json-file logging). What apparently happened is that the machine ran out of space (we had a clear indication for that) and log lines could not be written properly - so we got json log line that were cut in the middle. When that happens, docker will print out the following whenever someone tries to fetch the logs (even a simple docker logs command will reproduce this)

"Error streaming logs: invalid character 'l' after object key:value pair"

Apparently, docker will print this error and stop returning the log lines, even if there were many more lines after the corrupt json line. On our machine we had about a dozen services constantly logging and this JSON line corruption was found in most of them. It seems that docker encounters this and emits an error( https://github.com/moby/moby/blob/8af4db6f002ac907b6ef8610b237879dfcaa5b7a/daemon/logs.go#L109 ) yet Logspout immediately starts requesting logs again and, again, docker bombards the docker service logs with the above error message (tens per second).

Possible workarounds for such an issue (I didn't verify these options yet):

  1. Setting the BACKLOG env var and restarting services
  2. Stopping the service and moving/truncating the existing log files so that they would start off clean.

So this actually seems to be a docker issue. The best way to deal with this, IMHO, is to have docker detect such cases and roll the log file and put the corrupt log file aside (while prompting a meaningful warning about this)

One thing that puzzles me is why reverting to v3.1 "fixes" this issue (I didn't try reverting back to v3.1 yet, I'm using v3.3).

pythiancastaneda commented 7 years ago

Hello, I am having the same issue. Docker log is flooded with messages like this:

time="2017-08-30T22:30:31.922624624Z" level=error msg="Error streaming logs: invalid character '\\x00' looking for beginning of value" 
time="2017-08-30T22:30:31.977376555Z" level=error msg="Error streaming logs: invalid character '\\x00' looking for beginning of value" 
time="2017-08-30T22:30:32.172320949Z" level=error msg="Error streaming logs: invalid character '\\x00' looking for beginning of value" 
time="2017-08-30T22:30:32.570219033Z" level=error msg="Error streaming logs: invalid character '\\x00' looking for beginning of value" 
time="2017-08-30T22:30:32.662642497Z" level=error msg="Error streaming logs: invalid character '\\x00' looking for beginning of value" 
time="2017-08-30T22:30:32.860254549Z" level=error msg="Error streaming logs: invalid character '\\x00' looking for beginning of value" 

And the docker logs for bekt/logspout-logstash:master are:

2017/08/30 22:18:33 pump.pumpLogs(): 9972da45bf76 started
2017/08/30 22:18:33 pump.pumpLogs(): f6fddb18f835 started
2017/08/30 22:18:34 pump.pumpLogs(): 9972da45bf76 stopped
2017/08/30 22:18:34 pump.pumpLogs(): 9ab5e8cc0c36 stopped
2017/08/30 22:18:34 pump.pumpLogs(): 9972da45bf76 started
2017/08/30 22:18:34 pump.pumpLogs(): 9ab5e8cc0c36 started
2017/08/30 22:18:34 pump.Run() event: 42ad09d40141 network:connect
2017/08/30 22:18:34 pump.Run() event: orient-backu volume:mount
2017/08/30 22:18:34 pump.pumpLogs(): f6fddb18f835 stopped
2017/08/30 22:18:35 pump.pumpLogs(): f6fddb18f835 started
2017/08/30 22:18:35 pump.Run() event: orient-datab volume:mount
2017/08/30 22:18:35 pump.pumpLogs(): 9972da45bf76 stopped
2017/08/30 22:18:35 pump.Run() event: orient-log volume:mount
2017/08/30 22:18:35 pump.pumpLogs(): 9972da45bf76 started
2017/08/30 22:18:35 pump.Run() event: rancher-cni volume:mount
2017/08/30 22:18:35 pump.Run() event: efe51a4c5ed0 volume:mount
2017/08/30 22:18:36 pump.pumpLogs(): f6fddb18f835 stopped
2017/08/30 22:18:36 pump.pumpLogs(): 9972da45bf76 stopped
2017/08/30 22:18:36 pump.pumpLogs(): 9ab5e8cc0c36 stopped
2017/08/30 22:18:36 pump.Run() event: 58a80e107fc1 start
2017/08/30 22:18:36 pump.pumpLogs(): 9ab5e8cc0c36 started
2017/08/30 22:18:36 pump.pumpLogs(): 9972da45bf76 started
2017/08/30 22:18:36 pump.pumpLogs(): f6fddb18f835 started
2017/08/30 22:18:36 pump.pumpLogs(): 58a80e107fc1 started
2017/08/30 22:18:36 pump.pumpLogs(): 9972da45bf76 stopped
2017/08/30 22:18:36 pump.pumpLogs(): 9972da45bf76 started
2017/08/30 22:18:36 pump.pumpLogs(): 9ab5e8cc0c36 stopped
2017/08/30 22:18:36 pump.pumpLogs(): 9ab5e8cc0c36 started
2017/08/30 22:18:37 pump.pumpLogs(): 9972da45bf76 stopped
2017/08/30 22:18:37 pump.pumpLogs(): 9972da45bf76 started
2017/08/30 22:18:37 pump.pumpLogs(): f6fddb18f835 stopped
2017/08/30 22:18:37 pump.pumpLogs(): f6fddb18f835 started
2017/08/30 22:18:37 pump.pumpLogs(): 9ab5e8cc0c36 stopped
2017/08/30 22:18:37 pump.pumpLogs(): 9ab5e8cc0c36 started
2017/08/30 22:18:37 pump.pumpLogs(): 9972da45bf76 stopped
2017/08/30 22:18:37 pump.pumpLogs(): 9972da45bf76 started
2017/08/30 22:18:38 pump.pumpLogs(): be68d546188d stopped
2017/08/30 22:18:38 pump.pumpLogs(): be68d546188d started
2017/08/30 22:18:38 pump.pumpLogs(): f6fddb18f835 stopped
2017/08/30 22:18:38 pump.pumpLogs(): f6fddb18f835 started
2017/08/30 22:18:38 pump.pumpLogs(): 9972da45bf76 stopped
2017/08/30 22:18:38 pump.pumpLogs(): 9972da45bf76 started
2017/08/30 22:18:38 pump.pumpLogs(): 9ab5e8cc0c36 stopped
2017/08/30 22:18:38 pump.pumpLogs(): 9ab5e8cc0c36 started
2017/08/30 22:18:39 pump.pumpLogs(): 9972da45bf76 stopped
2017/08/30 22:18:39 pump.pumpLogs(): 9972da45bf76 started
2017/08/30 22:18:39 pump.pumpLogs(): 9ab5e8cc0c36 stopped
2017/08/30 22:18:39 pump.pumpLogs(): 9ab5e8cc0c36 started
2017/08/30 22:18:39 pump.pumpLogs(): f6fddb18f835 stopped
2017/08/30 22:18:39 pump.pumpLogs(): 9972da45bf76 stopped
2017/08/30 22:18:39 pump.pumpLogs(): f6fddb18f835 started
2017/08/30 22:18:39 pump.pumpLogs(): 9972da45bf76 started
2017/08/30 22:18:40 pump.pumpLogs(): 9ab5e8cc0c36 stopped
2017/08/30 22:18:40 pump.pumpLogs(): 9972da45bf76 stopped
2017/08/30 22:18:40 pump.pumpLogs(): 9972da45bf76 started
2017/08/30 22:18:40 pump.pumpLogs(): 9ab5e8cc0c36 started
2017/08/30 22:18:40 pump.pumpLogs(): f6fddb18f835 stopped
2017/08/30 22:18:40 pump.pumpLogs(): f6fddb18f835 started`

I have trimmed some of the docker-json logs and restarted the containers and it is not helping, once the logstput container starts it makes dockerd to eat almost all the CPU.

Thanks in advance!

michaelshobbs commented 7 years ago

@mattatcha thoughts on this given new info?

sfrique commented 7 years ago

@pythiancastaneda To figureout what container is giving you the trouble:

# tail -f /var/log/syslog | grep docker
Second term:
# docker logs "for each container you have"

That way you will see at logs:

time="2017-08-30T22:30:32.860254549Z" level=error msg="Error streaming logs: invalid character '\\x00' looking for beginning of value" 

And then you can remove the log or the wrong line.

Hope that helps.

I recently had the same problem and i did the steps above and fixed the issue.

leriel commented 7 years ago

Thanks for the update @stevecalvert ! In my case issue was the same corrupt container logs. For me, 3.2.2 also was logging an additional error. Posting that just in case

logspout_1  | 2017/09/15 17:42:43 pump.pumpLogs(): 0e1c957edf2d started
logspout_1  | 2017/09/15 17:42:43 pump.pumpLogs(): 0e1c957edf2d stopped with error: Unrecognized input header: 3
logspout_1  | 2017/09/15 17:42:43 pump.pumpLogs(): 0e1c957edf2d started
logspout_1  | 2017/09/15 17:42:43 pump.pumpLogs(): 0e1c957edf2d stopped with error: Unrecognized input header: 3

Only way for me to fix that issue, like for others, was to remove and recreate container.

konvergence commented 7 years ago

@leriel I get the same issue with logspout 3.2.2 and gelf plugins.

But i'm sure to understand the workarround ...

You recreate which container ? Logspout or containers with corrupted json-file logs ?

Is there any way to correct logspout code ?

konvergence commented 7 years ago

Don't know if thee is any relationship ... but I get this message if I tried to tail on a *-json.log, given by logspout in debug mode

tail: inotify cannot be used, reverting to polling: Too many open files seem to by a default limit on ubuntu ? that could explain a some corrupted json-file logs ? see https://nefaria.com/2014/08/tail-inotify-resources-exhausted/

konvergence commented 7 years ago

Another info ... when I purge all json-file logs given "stopped" by logspout, and restart it, then cpu usage on dockerd is flat

here my code:

for containerid in $(docker logs $(docker ps -a -q --filter "label=io.rancher.stack_service.name=graylog2/logspout")  2>&1| grep stopped | cut -d " " -f 4 | sort -u); do
     ls -l /var/lib/docker/containers/${containerid}*/${containerid}*-json.log
   > /var/lib/docker/containers/${containerid}*/${containerid}*-json.log
done 
konvergence commented 7 years ago

One more information, I use docker engine 17.03.2-ce. Last week, we had a power shutdown on the nodes. In my case, I think that it was the cause, why we get corrupted json-file logs (Error streaming logs: invalid character 'l' after object key:value pair) Do you know if these issue can be solved with a new docker-ce release ?

konvergence commented 7 years ago

Here some news ... But don't know if there is a relationship

I get all containerid given stopped by logspout: docker logs $(docker ps -a -q --filter "label=io.rancher.stack_service.name=graylog/logspout") 2>&1| grep stopped | cut -d " " -f 4 | sort -u

c062de60016b
c4d6b098968c
f8f18bba6203

I effectivelly found '\x00' of each file that is corrupted, and files are big


for containerid in $(docker ps -a -q); do
     grep -Pa '\x00' /var/lib/docker/containers/${containerid}*/${containerid}*-json.log
     if [ $? -eq 0 ]; then
          echo ${containerid}
          ls -lh /var/lib/docker/containers/${containerid}*/${containerid}*-json.log
    fi
done

{"log":"Creating volume folders...\n","stream":"stdout","time":"2017-09-16T15:50:34.379859731Z"}
c4d6b098968c
-rw-r----- 1 root root 564M Sep 21 12:11 /var/lib/docker/containers/c4d6b098968c475a82ed7ff34cc4e50344f097155f10d0d9d07528437b4417e7/c4d6b098968c475a82ed7ff34cc4e50344f097155f10d0d9d07528437b4417e7-json.log
{"log":"Creating volume folders...\n","stream":"stdout","time":"2017-09-16T15:50:33.076131673Z"}
c062de60016b
-rw-r----- 1 root root 478M Sep 21 12:11 /var/lib/docker/containers/c062de60016bd13faeebceb0301c121b35a0dfbc620048584a3d06c3d0449761/c062de60016bd13faeebceb0301c121b35a0dfbc620048584a3d06c3d0449761-json.log
{"log":"Creating volume folders...\n","stream":"stdout","time":"2017-09-16T15:50:47.996912867Z"}
f8f18bba6203
-rw-r----- 1 root root 806M Sep 21 12:11 /var/lib/docker/containers/f8f18bba62032e188dd75f76b9cc061936e804995289d7406b89d43e02d7cc36/f8f18bba62032e188dd75f76b9cc061936e804995289d7406b89d43e02d7cc36-json.log
pierreozoux commented 7 years ago

We experienced the same issue yesterday. We don't use logspout, but it is the closest issue we found, with this kind of error logs:

Jun 28 14:07:20 xxx docker[7686]: time="2017-06-28T14:07:20.892703572Z" level=error msg="Error streaming logs: invalid character 'l' after object key:value pair"

And the invalid char being anything in our case.

We are still investigating, but would like to see if we can find the issue together. It looks like with our current information that one of our containers is outputting some kind of json under specific conditions, and that dockerd doesn't like it at all. Didn't find any moby related issue though :/

We'll let you know of our findings.

Our setup is different though, docker 1.12 + k8s 1.7.5 with scalyr agent ingesting logs from dockerd.

mjablecnik commented 7 years ago

Hello I have:

$ docker version
Client:
 Version:      17.09.0-ce
 API version:  1.32
 Go version:   go1.8.3
 Git commit:   afdb6d4
 Built:        Tue Sep 26 22:42:09 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.09.0-ce
 API version:  1.32 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   afdb6d4
 Built:        Tue Sep 26 22:40:48 2017
 OS/Arch:      linux/amd64
 Experimental: false

System: Debian 9.1

logspout setup in Rancher: screenshot

and in syslog dockerd errors:

Oct 24 10:03:39 web4-vultr dockerd[31377]: time="2017-10-24T10:03:39.819294736+01:00" level=error msg="Error streaming logs: invalid character '\x00' looking for beginning of value" container=db62afc089ec method="(*Daemon).ContainerLogs" module=daemon
Oct 24 10:03:41 web4-vultr dockerd[31377]: time="2017-10-24T10:03:41.950449384+01:00" level=error msg="Error streaming logs: invalid character '\x00' looking for beginning of value" container=db62afc089ec method="(*Daemon).ContainerLogs" module=daemon
Oct 24 10:03:42 web4-vultr dockerd[31377]: time="2017-10-24T10:03:42.717452402+01:00" level=error msg="Error streaming logs: invalid character '\x00' looking for beginning of value" container=6488521fdc8f method="(*Daemon).ContainerLogs" module=daemon

and CPU usage is very high

I dont know why, but downgrade logspout to version v3.1 fix it..

rhuddleston commented 7 years ago

We're on AWS and see this issue regularly. We have ~100 EC2 instances and when AWS suddenly retires or shuts down an instance the server restarts but some of the docker json log files end up corrupt.

I can see where the corruption is and what line by doing the following:

for FILE in /var/lib/docker/containers/*/*-json.log; do cat $FILE | jq . > /dev/null; done

I used to just shutdown the container that corresponded to that corrupt logs(s), manually remove the corrupted (half written json line), then restart the container and all was well. Though with newer version of docker that no-longer fixes the issues I have to remove all the containers and re-deploy them to fix it now. I think this is really a Docker issue. If Docker detects a half written file I think it needs to handle this correctly. Basically 100% of the time AWS restarts a node unexpectedly this happens for us. Typically it's faster to completely replace the node than spend a bunch of time trying to fix this by hand.

ghost commented 6 years ago

We have the same issue but found the offending log files using grep below.

grep -P '\x00' /var/lib/docker/containers/**/*json.log

After manually removing the null "\x00" characters from the log the issue went away. It seems to be a docker issue docker/for-linux#140 related to unexpected host/daemon restarts.

ppiccolo commented 6 years ago

me too, same issue ... any suggestion ?

shahar-davidson commented 6 years ago

We resolved to use #321 as a workaround for the time being. (note that BACKLOG should not be set, or be set to true).

mahnunchik commented 6 years ago

I've faced with 100% CPU usage and error:

Mar 23 16:48:52 burger dockerd[569]: time="2018-03-23T16:48:52.971499270+02:00" level=error msg="Error streaming logs: EOF" container=b1dc4286ba72 method="(*Daemon).ContainerLogs" module=daemon

on cadvisor container.

My config:

version: '2.3'

services:

  logspout:
    image: gliderlabs/logspout:v3.2.4
    container_name: logspout
    command: syslog+tcp://logs-01.loggly.com:514
    restart: always
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    environment:
      TAIL: 10
arunesh90 commented 6 years ago

Any update on this? Just had it happen on a server which became pretty unresponsive because of it

kamaev commented 6 years ago

+upvoting rollback to 3.1

I've just force pushed ca675f3030a87f41536415114800470c0c74cd0d and manually added:

- glide.lock
- glide.yaml
- Dockerfile
- build.sh
- and also /adapters/multiline/multiline.go to use multiline

from latest master to build an image new container works good

biguphpc commented 3 years ago

I just wanted to confirm about this issue with the latest CE version of Docker (19.03.12) and logspout (syslog+tls). We've had production servers crashing because of this. Dockerd was using 1600% CPU and then became unresponsive on a 32 CPU server. Noticed a very high number of goroutines showing with docker info in debug mode. Stopping the logspout container and waiting a few hours resolved the issue and the Dockerd CPU was back to normal (along with the number of goroutines). Tried a workarounds like TAIL=0 without success. Reverting to v3.1 solves the issue for now.

We had about 30 containers running on that machine, logging was heavy but not excessive. It's easily reproducible with the latest logspout version: starting logspout with a lot of containers logging show dockerd using high CPU almost immediately and increasing over time.

Maybe v3.1 is querying the docker daemon differently than newer versions and flooding it with requests leading to a high number of goroutines and high CPU?