Closed netudima closed 4 years ago
Our implementation is based on a best-effort implementation. Typically, NIO implementations allow for buffering and don't block the calling thread. We didn't want to rebuild what netty provides. Feel free to come up with a proposal.
We could provide various Backofen strategies (fixed, exponential), not sure we need backoff considering jitter.
Yes, understand, no concerns here. Unfortunately we have faced the issue on some of our environments several times when due to the described reason several applications start to consume 2-3x CPU more compared to a normal mode for ~ 10 minutes. Async profiler cpu profile captured at that time shows what biz/paluch/logging/gelf/intern/sender/GelfTCPSender.write spends 73% of total CPU usage. The apps did not have a huge log sent rate, based on file logs printed in parallel by a file appender we logged only 48 messages during this time period. "deliveryAttempts" parameter of GELF appender has a default value = 1.
Feel free to come up with a proposal.
yes, I will prepare a MR with a possible way to add a backoff strategy.
The flamegraph from async profiler:
That's in place now.
logstash-gelf version: 1.13.0 OS: Linux, 4.17.2-1.el7.elrepo.x86_64
If a server side for a GELF TCP connection is slow for some reason (for example - it is overloaded with some other apps) - written data fills TCP buffers of server and receiver for the TCP connection. TCP back-pressure is active and the sender cannot send data to its socket. In case of NIO non-blocked socket channel().write() will return 0.
GelfTCPSender#write has the following logic:
For the described use case - the logic actually works as a spin loop which burns CPU on the sender host until the graylog server will not read data on its side of TCP connection. Some backoff is needed here if written == 0.