Backport TCP window, client closing fixes to 6.0.x

yaauie commented 1 year ago

Because 6.1.0 changed this plugin's dependency on Logstash to require Logstash 8 and we still commercially support Logstash 7.13+, I have branched from 6.0.2 and am backporting relevant fixes to the 6.0.x series.

This is not typical SEMVER, and I would like to find a way to separately get 6.1.x of this plugin running on Logstash 7, even if it is without one or more of the features that were introduced for the 6.1 series of this plugin (e.g., works unless you try to use the new feature on LS7).

Here, we manually backport two main fixes, each in its own commit for clarity:

client-closing and logging standardization (subset of #47 )
overlarge payload handling (#49)

We also:

cross-port the solution from #49 that handled overlarge payloads in client mode and applied them to server-mode.
add baseline specs

Unfortunately, in addition to breaking compatibility with LS 7.x, #47 also introduced two bugs that were fixed in #49 (repeating thread names, logger access), one more that is first-addressed here (crash while closing plugin), and will need to be forward-ported) so the lines are muddled.

I have elected to group whole fixes by commit as much as possible, instead of doing true backports.

yaauie commented 1 year ago

CI is failing intermittently for the cross-ported fixes for server mode, and I am having a very difficult time figuring out why as everything is very consistently green in local docker tests (on my M1 Mac) with identical parameters. Every single IO we have control over is written in a way that makes it tolerant to partial-writes, and still we are occasionally seeing silent truncation on the 100MB test.

Since 3a6113da7f0dd3a598fdf70ec72b35674cb1cd0c all of the failures have been in the Logstash 7 series, so I am beginning to think that I have found a bug in Jruby.

yaauie commented 1 year ago

🤦🏼 when I reduced the amount of memory available to my local docker, I was able to reproduce.

~In the test code I was reading until EOF instead of until the IO was closed, and the memory pressure slowed down reads enough that I hit EOF before the writer was done writing.~

EDIT: that wasn't it either. Because server-mode's Outputs::TCP#receive does not block, we need to wait to send close to the plugin until the payload is "probably" fully-written. We do this by blocking until the StringIO that is receiving the payload is approximately the right size.

andsel commented 1 year ago

This is a follow up thought I had reviewing the code, but it's more related to the plugin implementation than this PR. I would really appreciate to listen to your thoughts.

I think we have a problem in the shutdown of client threads in server-mode. Each thread execute a busy loop in Client#run, without any guard on exit condition, mainly doing

loop do
  begin
    payload = @queue.pop
    @socket.write(payload)
  rescue 
    break  
  end
end

The close method of the Client is:

def close
    @socket.close
end

The client threads could be blocked in 2 places:

queue.pop
socket.write If it's blocked on the queue.pop, because queue is empty, then the close method doesn't have any effect. In this case the thread should be interrupted (I think to something like Java Thread.interrupt() ) to force it exit. Plus the busy loop should test for interrupted test to exit. If it's blocked on socket.write then the socket.close generate an exception in the loop which make it exit. Also, the plugin's stop method, shutdown the server socket, stopping the acceptance of new connections and the spawn of new client threads, but doesn't join the already created threads. This is implicitly done by the close of the socket, because being blocking call, we know that when it returns, the socket is closed, the Client received an error and exited, but I think we depend too much on the scheduler here. I think it could happen that plugin terminated, but one of the client threads has still to reach the socket.write that generates the exception and force it to break the loop.

So my point is that we need to improve the shutdown sequence, interrupting the thread and joining it.

yaauie commented 1 year ago

So my point is that we need to improve the shutdown sequence, interrupting the thread and joining it.

I agree. There are a lot of places in this plugin that need improvement. In server mode the queue for each client is also unbounded. Lots of room for improvement.

In the end, though, this PR makes the plugin significantly better than it was before and I would rather deliver these improvements immediately and chase down the further fixes separately.

It's not clean to me why we need to sleep 10ms if the payload contains data

You are correct; both IO#write and IO#syswrite calls should block until the the underlying IO is writable, and so the backoff sleep isn't strictly necessary. I'll back that out.

I'll then merge this PR, forward-port the relevant bits to the 6.1 series, and then file separate work to chase down these other issues. (including hopefully making 6.1 compatible with Logstash 7.x again so that we don't have to maintain two somewhat-significantly-diverged branches).

andsel commented 1 year ago

I'm ok to merge this PR as it's. A lot of work has been done!

My last comment was intended to open the discussion on what I've seen reviewing it, but that's not related to this specific PR.

logstash-plugins / logstash-output-tcp

Backport TCP window, client closing fixes to 6.0.x #50