Improvement to awslogs driver: log failures to submit to cloudwatch in the next submission to cloudwatch.

Description

When the docker host can no longer reach cloudwatch (e.g. due to a network outage), log events are (almost silently) dropped by the awslogs driver. A log entry to mark the failure is added to the docker daemon logs on the host.

An improvement to this fallback would be to also leave a trace of this in the aws cloudwatch logs. The next batch of log events could be prefixed with a message indicating how many messages total have been dropped. When connectivity to cloudwatch is restored, leaving a trace of the failure would help assess how many log entries are missing.

A further improvement would be to buffer logs to a circular buffer on disk during a network outage. And then uploading the old logs that were missed when connectivity resumes.

Steps to reproduce the issue:

configure a container with the awslogs log driver.
make cloudwatch unreachable from the docker host.
let the awslogs driver fail to submit a batch or two
restore network connectivity to cloudwatch
notice how the logstream in cloudwatch does not have an indication that log events are missing.

Describe the results you received:

Describe the results you expected:

Just a small entry in the log stream that says how many events have been dropped.

Additional information you deem important (e.g. issue happens only occasionally):

Output of docker version:

Client:
 Version:           19.03.6-ce
 API version:       1.40
 Go version:        go1.13.4
 Git commit:        369ce74
 Built:             Fri May 29 04:01:26 2020
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          19.03.6-ce
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.13.4
  Git commit:       369ce74
  Built:            Fri May 29 04:01:57 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.3.2
  GitCommit:        ff48f57fc83a8c44cf4ad5d672424a98ba37ded6
 runc:
  Version:          1.0.0-rc10
  GitCommit:        dc9208a3303feef5b3839f4323d9beb36df0a9dd
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

Output of docker info:

Client:
 Debug Mode: false

Server:
 Containers: 1
  Running: 1
  Paused: 0
  Stopped: 0
 Images: 1
 Server Version: 19.03.6-ce
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: ff48f57fc83a8c44cf4ad5d672424a98ba37ded6
 runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
 init version: fec3683
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 4.14.177-139.254.amzn2.x86_64
 Operating System: Amazon Linux 2
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 9.91GiB
 Name: ip-10-1-59-52.us-west-2.compute.internal
 ID: O7SX:3R4H:ZCJQ:QVH7:2HIL:CVE6:XY6X:HXWA:IDBC:NFZB:F4IL:V5EM
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.):

running on an instance on AWS EC2.

The next batch of log events could be prefixed with a message indicating how many messages total have been dropped. When connectivity to cloudwatch is restored, leaving a trace of the failure would help assess how many log entries are missing.

This is pretty hard to do in a backwards-compatible way. If someone has built parsing logic (for monitoring, alarming, or any other use-case) around the data in their log stream, injecting new entries in there that are unexpected can cause the parsing logic to break. I think this behavior could potentially be opt-in, but not on by default.

A further improvement would be to buffer logs to a circular buffer on disk during a network outage. And then uploading the old logs that were missed when connectivity resumes.

Disk-based buffers are a reasonable approach to take, but just like memory there needs to be a limit; it wouldn't be great for your disk-buffer to consume all your disk space. This will increase the amount of log entries that can be buffered during some sort of failure, but not eliminate the potential to drop entries.

moby / moby

Improvement to awslogs driver: log failures to submit to cloudwatch in the next submission to cloudwatch. #41245