Splunk driver not getting response from splunk makes docker unresponsive

Carles-Figuerola commented 5 years ago

What happened: We have a cluster of nodes running docker and managed by marathon/mesos. The containers running there are using the docker splunk logging plugin to send logs to the splunk event collector.

The load balancer in front of the splunk event collector was having trouble connecting so from the point of view of the logging plugin, the https connections were being opened, but not replied, so all connections were "hanging". This made all the environment unstable as containers were not passing healthchecks and not able to serve the application running on them.

An example of the logs seen in docker are:

Aug 12 12:50:34 dockerhost.local dockerd[10030]: time="2019-08-12T12:50:34.493818095-07:00" level=warning msg="Error while sending logs" error="Post https://splunk-ec:443/services/collector/event/1.0: context deadline exceeded" module=logger/splunk

The manual connection to the splunk-ec shows that it hangs after sending the headers and will get no response at all:

$ curl -vk https://splunk-ec:443/services/collector/event/1.0
* About to connect() to splunk-ec port 443 (#0)
*   Trying 10.0.0.1...
* Connected to splunk-ec (10.0.0.1) port 443 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
* skipping SSL peer certificate verification
* SSL connection using TLS_RSA_WITH_AES_256_CBC_SHA
* Server certificate:
*       subject: CN=<REDACTED>
*       start date: Jan 22 16:45:30 2010 GMT
*       expire date: Jan 23 01:36:42 2020 GMT
*       common name: <REDACTED>
*       issuer: CN=Entrust Certification Authority - L1C,OU="(c) 2009 Entrust, Inc.",OU=www.entrust.net/rpa is incorporated by reference,O="Entrust, Inc.",C=US
> GET /services/collector/event/1.0 HTTP/1.1
> User-Agent: curl/7.29.0
> Host: splunk-ec
> Accept: */*
>
^C

What you expected to happen: If the splunk logging driver can't send logs for any reason, it should fill the buffer and drop logs when it's full, not make the docker agent unstable and make the application inaccessible

How to reproduce it (as minimally and precisely as possible): Have a small app (maybe just nc -l -p443) listen in https but not make any reply either successful or unsuccessful, then point the splunk logging plugin there.

Anything else we need to know?: The docker agent runs with these environment variables:

SPLUNK_LOGGING_DRIVER_BUFFER_MAX=400
SPLUNK_LOGGING_DRIVER_CHANNEL_SIZE=200
SPLUNK_LOGGING_DRIVER_POST_MESSAGES_BATCH_SIZE=20

the containers are running with these options:

--log-driver=splunk
--log-opt=splunk-token=<token>
--log-opt=splunk-url=https://splunk-ec:443
--log-opt=splunk-index=app
--log-opt=splunk-sourcetype=<sourcetype>
--log-opt=splunk-insecureskipverify=true
--log-opt=env=APP_NAME,HOST,ACTIVE_VERSION
--log-opt=splunk-format=raw
--log-opt=splunk-verify-connection=false

Environment:

Docker version (use docker version):

Server: Docker Engine - Community
Engine:
Version:          18.09.2
API version:      1.39 (minimum version 1.12)
Go version:       go1.10.6
Git commit:       6247962
Built:            Sun Feb 10 03:47:25 2019
OS/Arch:          linux/amd64
Experimental:     false

OS (e.g: cat /etc/os-release):

CentOS Linux release 7.6.1810 (Core)
Linux hostname 3.10.0-957.12.1.el7.x86_64 #1 SMP Mon Apr 29 14:59:59 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Splunk version:
```
7.1.6
```
(this shouldn't affect as the problem was with splunk not getting an https response from the load balancer)
Others:

dtregonning commented 5 years ago

Thank You for reporting issue - @Carles-Figuerola We are aware that the unavailability of Splunk can have really negative consequences on container health.

This will definitely be something we address and fix, however we may make the behavior an explicit selection. Ideally we need a solution which doesn't bork the container and doesn't drop logs.

Carles-Figuerola commented 5 years ago

Making it an explicit selection is totally an acceptable and good solution. Thanks!

PiotrJustyna commented 5 years ago

@dtregonning Thank you for responding to the report so quickly. Unfortunately this also affects my application. Are there any updates on this? Alternatively, is there are recommended workaround?

gabricar commented 4 years ago

Any updates on this?

zerog2k commented 4 years ago

Any workarounds, such as tweaking timeout/retry/abandon settings, etc? Any way to detect the issue existing aside from monitoring docker daemon log stream?

fabriciofelipe commented 4 years ago

we are also having the same problem. Any updates on this?

ykapustin commented 4 years ago

As a workaround, try to enable non-blocking log delivery mode:

https://docs.docker.com/config/containers/logging/configure/#configure-the-delivery-mode-of-log-messages-from-container-to-log-driver

johnjelinek commented 3 years ago

bump

joachimbuechse commented 3 years ago

Same problem here with dockers built-in splunk log driver. Problem starts with connections in CLOSE_WAIT to heavy forwarder and then propagates to docker containers. Was hoping to find a solution here, but apparently same problem would apply using the plugin.

zerog2k commented 3 years ago

Can anyone here with this issue that setting mode to non-blocking at least mitigates this issue? i.e. does this work in docker daemon.json with the splunk logging driver:

  "log-opts": {
    "mode": "non-blocking",
    "max-buffer-size": "4m"
 }

On a side note, we also want to ensure that we never compromise the availability of production containers/docker api, even if logging system (still very important) goes down. Ideally, there would be some local temporary filesystem buffer (up to some sane limit, say ~100MB, possibly depending upon how noisy and how many containers you run) which would allow queued up delivery of logs when/if the splunk endpoint eventually comes back up. This could ensure that temporary splunk endpoint availability issues are survivable without noticeable impact to container functionality and logging, while allowing for container functionality (at the cost of lost logs) for extended splunk endpoint availability outages.

ykapustin commented 3 years ago

Can anyone here with this issue that setting mode to non-blocking at least mitigates this issue? i.e. does this work in docker daemon.json with the splunk logging driver:
  "log-opts": {
    "mode": "non-blocking",
    "max-buffer-size": "4m"
 }

Since we enabled non-blocking mode we haven't see any issue.

Idriosiris commented 2 years ago

Hello, I am currently experiencing the same issues as above.

Already set the non-blocking mode for docker as well as using the default Splunk globals as per below.

SPLUNK_LOGGING_DRIVER_POST_MESSAGES_FREQUENCY | 5s SPLUNK_LOGGING_DRIVER_POST_MESSAGES_BATCH_SIZE | 1000 SPLUNK_LOGGING_DRIVER_BUFFER_MAX | 10 1000 SPLUNK_LOGGING_DRIVER_CHANNEL_SIZE | 4 1000

Any current updates of the Splunk driver that might fix this?

Would reduce the buffer_max and channel_size values help me in any way? People mentioned that setting the non-blocking mode worked for them, wonder if it's something else that helped in conjunction with that.

P.S - A timeout would really help here :)

kulack commented 11 months ago

I know this is an old issue, but I thought I'd post here for people reading. Both with this plugin and the built-in splunk driver, none of the workarounds work for us. I tested by using iptables to block access to our splunk instance/collector and it basically stops all of our containers. We'll be trying to find a different log driver.

splunk / docker-logging-plugin

Splunk driver not getting response from splunk makes docker unresponsive #55