Ensure IO#gets exits in a reasonable time

MikaelSmith commented 7 years ago

According to the notes on heartbeats, max_hbrlck_fails should be enabled in order to determine when heartbeats stop arriving. This is required because the receive thread of the MCO daemon spends most of its time waiting for something to come in over the wire. That ties up the "read lock" used to provide exclusive access to the underlying network socket.

However, the blocking receive should consume heartbeats and update the "last received" timestamp (@lr). The heartbeat read thread also consumes pings coming in over the wire in case the process is not blocking on receive.

Normal operation means that a missed hbrlck is normal, but two in a row should be abnormal (the heartbeat thread may try to read and get locked, but the receive thread should read the heartbeat before the heartbeat thread resumes - networking timing could impact this in bad ways - and reset the "last received" timestamp which will trigger resetting the read_lock_count and lock_fail_count the next time the heartbeat thread wakes).

When the lock_fail_count exceeds @max_hbrlck_fails, the heartbeat thread closes the socket and kills the heartbeat threads. The expectation is that this should cause the IO#gets call in the receive call to return. That doesn't happen until TCP timeouts occur, which can take minutes or hours. When the TCP timeout occurs, it appears to leave the main thread in an unexpected state that sometimes results in SIGABRT and is unreachable via the broker.

To exit gets early, trigger an exception in the receive thread. That interrupts the gets call and is caught in the transmit or __old_receive method and initiates a reconnect.

MikaelSmith commented 7 years ago

We've been introducing firewall rules to block communication with the ActiveMQ broker to simulate the disconnect. I'm not sure this is the best solution, but there are parts of the netio implementation that we're wary to change without better understanding.

MikaelSmith commented 7 years ago

This might actually be a non-issue with the dev branch. Testing it out.

Update: ok, I only encounter this problem on *nix (CentOS 7 right now), not Windows.

Not sure of a simple reproduction case. Currently using mcollective.

MikaelSmith commented 7 years ago

Some more context: it's not thread-safe to call Socket#close on an SSL socket in one thread while another thread is reading from it. So the core issue in Stomp is that it calls Socket#close while the main thread is waiting on a Socket#gets call. Thus triggering a segfault. @reidmv

Simpler reproduction case: https://gist.github.com/richardc/b5a1d4d28c8bbae24cd5

MikaelSmith commented 7 years ago

Thanks! Do you have any plans when you'll do the next release?

gmallard commented 7 years ago

That has already been done. 1.4.4 is on Rubygems.

MikaelSmith commented 7 years ago

Awesome! I didn't see a tag on the repo, so thought it hadn't been done.

stompgem / stomp

Ensure IO#gets exits in a reasonable time #141