Open Carles-Figuerola opened 5 years ago
Thank You for reporting issue - @Carles-Figuerola We are aware that the unavailability of Splunk can have really negative consequences on container health.
This will definitely be something we address and fix, however we may make the behavior an explicit selection. Ideally we need a solution which doesn't bork the container and doesn't drop logs.
Making it an explicit selection is totally an acceptable and good solution. Thanks!
@dtregonning Thank you for responding to the report so quickly. Unfortunately this also affects my application. Are there any updates on this? Alternatively, is there are recommended workaround?
Any updates on this?
Any workarounds, such as tweaking timeout/retry/abandon settings, etc? Any way to detect the issue existing aside from monitoring docker daemon log stream?
we are also having the same problem. Any updates on this?
As a workaround, try to enable non-blocking log delivery mode:
bump
Same problem here with dockers built-in splunk log driver. Problem starts with connections in CLOSE_WAIT to heavy forwarder and then propagates to docker containers. Was hoping to find a solution here, but apparently same problem would apply using the plugin.
Can anyone here with this issue that setting mode to non-blocking at least mitigates this issue? i.e. does this work in docker daemon.json with the splunk logging driver:
"log-opts": {
"mode": "non-blocking",
"max-buffer-size": "4m"
}
On a side note, we also want to ensure that we never compromise the availability of production containers/docker api, even if logging system (still very important) goes down. Ideally, there would be some local temporary filesystem buffer (up to some sane limit, say ~100MB, possibly depending upon how noisy and how many containers you run) which would allow queued up delivery of logs when/if the splunk endpoint eventually comes back up. This could ensure that temporary splunk endpoint availability issues are survivable without noticeable impact to container functionality and logging, while allowing for container functionality (at the cost of lost logs) for extended splunk endpoint availability outages.
Can anyone here with this issue that setting mode to non-blocking at least mitigates this issue? i.e. does this work in docker daemon.json with the splunk logging driver:
"log-opts": { "mode": "non-blocking", "max-buffer-size": "4m" }
Since we enabled non-blocking mode we haven't see any issue.
Hello, I am currently experiencing the same issues as above.
Already set the non-blocking mode for docker as well as using the default Splunk globals as per below.
SPLUNK_LOGGING_DRIVER_POST_MESSAGES_FREQUENCY | 5s SPLUNK_LOGGING_DRIVER_POST_MESSAGES_BATCH_SIZE | 1000 SPLUNK_LOGGING_DRIVER_BUFFER_MAX | 10 1000 SPLUNK_LOGGING_DRIVER_CHANNEL_SIZE | 4 1000
Any current updates of the Splunk driver that might fix this?
Would reduce the buffer_max and channel_size values help me in any way? People mentioned that setting the non-blocking mode worked for them, wonder if it's something else that helped in conjunction with that.
P.S - A timeout would really help here :)
I know this is an old issue, but I thought I'd post here for people reading. Both with this plugin and the built-in splunk driver, none of the workarounds work for us. I tested by using iptables to block access to our splunk instance/collector and it basically stops all of our containers. We'll be trying to find a different log driver.
What happened: We have a cluster of nodes running docker and managed by marathon/mesos. The containers running there are using the docker splunk logging plugin to send logs to the splunk event collector.
The load balancer in front of the splunk event collector was having trouble connecting so from the point of view of the logging plugin, the https connections were being opened, but not replied, so all connections were "hanging". This made all the environment unstable as containers were not passing healthchecks and not able to serve the application running on them.
An example of the logs seen in docker are:
The manual connection to the splunk-ec shows that it hangs after sending the headers and will get no response at all:
What you expected to happen: If the splunk logging driver can't send logs for any reason, it should fill the buffer and drop logs when it's full, not make the docker agent unstable and make the application inaccessible
How to reproduce it (as minimally and precisely as possible): Have a small app (maybe just
nc -l -p443
) listen in https but not make any reply either successful or unsuccessful, then point the splunk logging plugin there.Anything else we need to know?: The docker agent runs with these environment variables:
the containers are running with these options:
Environment:
docker version
):cat /etc/os-release
):(this shouldn't affect as the problem was with splunk not getting an https response from the load balancer)