Open init-js opened 4 years ago
/cc @samuelkarp @cpuguy83
The next batch of log events could be prefixed with a message indicating how many messages total have been dropped. When connectivity to cloudwatch is restored, leaving a trace of the failure would help assess how many log entries are missing.
This is pretty hard to do in a backwards-compatible way. If someone has built parsing logic (for monitoring, alarming, or any other use-case) around the data in their log stream, injecting new entries in there that are unexpected can cause the parsing logic to break. I think this behavior could potentially be opt-in, but not on by default.
A further improvement would be to buffer logs to a circular buffer on disk during a network outage. And then uploading the old logs that were missed when connectivity resumes.
Disk-based buffers are a reasonable approach to take, but just like memory there needs to be a limit; it wouldn't be great for your disk-buffer to consume all your disk space. This will increase the amount of log entries that can be buffered during some sort of failure, but not eliminate the potential to drop entries.
Description
When the docker host can no longer reach cloudwatch (e.g. due to a network outage), log events are (almost silently) dropped by the awslogs driver. A log entry to mark the failure is added to the docker daemon logs on the host.
An improvement to this fallback would be to also leave a trace of this in the aws cloudwatch logs. The next batch of log events could be prefixed with a message indicating how many messages total have been dropped. When connectivity to cloudwatch is restored, leaving a trace of the failure would help assess how many log entries are missing.
A further improvement would be to buffer logs to a circular buffer on disk during a network outage. And then uploading the old logs that were missed when connectivity resumes.
Steps to reproduce the issue:
Describe the results you received:
Describe the results you expected:
Just a small entry in the log stream that says how many events have been dropped.
Additional information you deem important (e.g. issue happens only occasionally):
Output of
docker version
:Output of
docker info
:Additional environment details (AWS, VirtualBox, physical, etc.):
running on an instance on AWS EC2.