Containers using gcplogs logging driver would not start while Stackdriver WebApp is down

robborden commented 8 years ago

Output of docker version:

docker version Client: Version: 1.11.1 API version: 1.23 Go version: go1.5.4 Git commit: 5604cbe Built: Tue Apr 26 23:30:23 2016 OS/Arch: linux/amd64

Server: Version: 1.11.1 API version: 1.23 Go version: go1.5.4 Git commit: 5604cbe Built: Tue Apr 26 23:30:23 2016 OS/Arch: linux/amd64

Output of docker info:

Containers: 49 Running: 47 Paused: 0 Stopped: 2 Images: 128 Server Version: 1.11.1 Storage Driver: aufs Root Dir: /data/docker/aufs Backing Filesystem: extfs Dirs: 255 Dirperm1 Supported: true Logging Driver: json-file Cgroup Driver: cgroupfs Plugins: Volume: local Network: bridge null host Kernel Version: 4.2.0-41-generic Operating System: Ubuntu 14.04.4 LTS OSType: linux Architecture: x86_64 CPUs: 4 Total Memory: 14.69 GiB Name: daemon-02 ID: 75B6:TRJI:APWP:BHTP:D3Q5:KQWH:WRBZ:UFTP:2XUC:2OKF:YQU3:X3DO Docker Root Dir: /data/docker Debug mode (client): false Debug mode (server): false Registry: https://index.docker.io/v1/ WARNING: No swap limit support

Additional environment details (AWS, VirtualBox, physical, etc.):

Most container's run a mono command

Steps to reproduce the issue:

Turn off stackdriver :P
Start container with gcplogs driver
Observe container start and then stop immediately
Fail to find related logs

Additional Information:

Today Stackdriver was temporarily down. We had a network outage on GCE around the same time (probably related) which was resolved more quickly than Stackdriver's issue. When we restarted a couple of our services, they wouldn't start with the gcplogs driver enabled. Once I switched to json file logs, they started fine. Once Stackdriver was back online, everything was fine. Also, containers using the gcplogs driver that remained running were not impacted by the fact that Stackdriver was down (besides not pushing logs of course). Does this fall into the "it's a feature, not a bug" category?

ehazlett commented 8 years ago

/cc @mikedanese

mikedanese commented 8 years ago

I think that this is unrelated to stack driver and more related to not being able to contact googleapis.com. The check in question is here:

https://github.com/docker/docker/blob/master/daemon/logger/gcplogs/gcplogging.go#L120

Can you confirm that is an error you are seeing in your docker daemon's logs?

Other logging drivers are fairly inconsistent with how they handle an unreachable sink. From a spot check it looks like:

aws doesn't check
fluentd checks and fails container creation if unreachable if the logging driver is in sync mode which is the default but when in async mode it does not fail
splunk checks and fails container creation if unreachable always

Can an owner of logging drivers weigh in on what the best design here is?

thaJeztah commented 8 years ago

Can an owner of logging drivers weigh in on what the best design here is?

ping @cpuguy83 @stevvooe ^^

stevvooe commented 8 years ago

Failure of a remote logging endpoint should never fail an application (I have a caveat to this below).

That said, when the endpoint is unreachable, we need to define the behavior:

spool logs endlessly and forward on recovery.
spool logs into rotating buffer and forward available on recovery (aka save last N).
Drop all logs, lose data.
Fail application.

There are probably arguments for all these cases. For a production system, where one doesn't care about all log output, option 2 is by far the best. I am not sure if other drivers actually do this. Failing the application, option 4, would only be for systems where logs are some sort of system of record.

moby / moby

Containers using gcplogs logging driver would not start while Stackdriver WebApp is down #25940