moby / moby

The Moby Project - a collaborative project for the container ecosystem to assemble container-based systems
https://mobyproject.org/
Apache License 2.0
68.64k stars 18.64k forks source link

Docker logging driver doesn't reconnect running containers' log streams after Fluentd restart #34804

Open zarbis opened 7 years ago

zarbis commented 7 years ago

Description

Docker daemon doesn't correctly reconnect running containers' log streams to Fluentd after Fluentd being restarted.

Steps to reproduce the issue:

  1. Configure daemon with Fluentd driver or run container/service with --log-driver=fluentd option.

    {
    "log-driver": "fluentd",
    "log-opts": {
    "fluentd-async-connect": "true"
    }
    }
  2. Start global Fluent-bit service (reproduces with Fluentd too): docker service create --name fluent-bit --mode global -p 24224:24224 fluent/fluent-bit:latest

  3. Start some test service: docker service create --name dater --mode global alpine sh -c 'while true; do date; sleep 1; done'

  4. Confirm that fluent-bit receives logs: docker service logs -f fluent-bit

  5. Remove and immediately re-create fluent-bit service.

  6. Inspect fluent-bit service logs like in step 4.

  7. Re-create test service.

  8. Inspect fluent-bit service logs once more.

Describe the results you received:

After re-creating fluent-bit service it stops receiving logs from other containers until those containers are re-created.

Describe the results you expected:

Docker daemon reconnects to Fluentd output and resumes sending logs of running containers.

Additional information you deem important (e.g. issue happens only occasionally):

Relevant parts of daemon logs:

time="2017-09-11T17:27:16.653512364+03:00" level=error msg="Failed to log msg \"\" for logger fluentd: write tcp 127.0.0.1:45096->127.0.0.1:24224: write: broken pipe"
time="2017-09-11T17:27:16.653640929+03:00" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
time="2017-09-11T17:27:16.653696884+03:00" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
time="2017-09-11T17:27:19.654082429+03:00" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"

For some reason it gives up faster than default fluentd-max-retries times fluentd-retry-wait (10x1sec)

Relevant connections:

# netstat -pnut | grep 24224
tcp        1      0 127.0.0.1:45102         127.0.0.1:24224         CLOSE_WAIT  28789/dockerd   
tcp        0      0 127.0.0.1:45270         127.0.0.1:24224         ESTABLISHED 28789/dockerd   
tcp        0 644742 127.0.0.1:45166         127.0.0.1:24224         ESTABLISHED 28789/dockerd

This stayed unchanged all the way I was writing this report, for about 30 minutes.

Output of docker version:

Client:
 Version:      17.06.2-ce
 API version:  1.30
 Go version:   go1.8.3
 Git commit:   cec0b72
 Built:        Tue Sep  5 20:00:17 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.06.2-ce
 API version:  1.30 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   cec0b72
 Built:        Tue Sep  5 19:59:11 2017
 OS/Arch:      linux/amd64
 Experimental: fals

Output of docker info:

Containers: 5
 Running: 3
 Paused: 0
 Stopped: 2
Images: 3
Server Version: 17.06.2-ce
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 33
 Dirperm1 Supported: true
Logging Driver: fluentd
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
 NodeID: 1tmn19uqnfeg6y68h2r3ulj74
 Is Manager: true
 ClusterID: l2ozazncaoza14gvv41ytrcvg
 Managers: 2
 Nodes: 2
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
 Root Rotation In Progress: false
 Node Address: 172.16.0.6
 Manager Addresses:
  172.16.0.5:2377
  172.16.0.6:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 6e23458c129b551d5c9871e5174f6b1b7f6d1170
runc version: 810190ceaa507aa2727d7ae6f4790c76ec150bd2
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-93-generic
Operating System: Ubuntu 16.04.3 LTS
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 62.82GiB
Name: srv6
ID: FFHH:BKKX:X5FI:OS3W:VY32:Q3TL:VQUH:D5U6:YJ4K:4TJL:KAN4:I5RF
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

Additional environment details (AWS, VirtualBox, physical, etc.): Ubuntu 16.04 on physical server.

thaJeztah commented 7 years ago

ping @michaelwilde @outcoldman PTAL

outcoldman commented 7 years ago

@thaJeztah we were responsible for Splunk driver.

thaJeztah commented 7 years ago

@outcoldman oh, brainfart, I was looking on a different issue and completely miswired my thinking, LOL

thaJeztah commented 7 years ago

Slightly related discussion https://github.com/moby/moby/issues/32567

ping @tagomoris @cpuguy83 PTAL

max-lobur commented 6 years ago

@zarbis do you think this can be avoided by removing fluentd-async-connect ?

danilo-augusto commented 6 years ago

@max-lobur Yes, it works this way :) However it's not the optimal solution since those logs are not going to be buffered in case of a fluentd crash, but it's better than crashing the entire docker daemon anyway...

stszap commented 6 years ago

I think we faced the same problem once. Ended up running fluentd container with --network=host, listening on 127.0.0.1 and using fluentd-address=tcp://127.0.0.1:24224 for other containers (with async mode too). Never had any issues since then.