Closed MikaelSmith closed 7 years ago
We've been introducing firewall rules to block communication with the ActiveMQ broker to simulate the disconnect. I'm not sure this is the best solution, but there are parts of the netio implementation that we're wary to change without better understanding.
This might actually be a non-issue with the dev branch. Testing it out.
Update: ok, I only encounter this problem on *nix (CentOS 7 right now), not Windows.
Not sure of a simple reproduction case. Currently using mcollective.
Some more context: it's not thread-safe to call Socket#close on an SSL socket in one thread while another thread is reading from it. So the core issue in Stomp is that it calls Socket#close while the main thread is waiting on a Socket#gets call. Thus triggering a segfault. @reidmv
Simpler reproduction case: https://gist.github.com/richardc/b5a1d4d28c8bbae24cd5
Thanks! Do you have any plans when you'll do the next release?
That has already been done. 1.4.4 is on Rubygems.
Awesome! I didn't see a tag on the repo, so thought it hadn't been done.
According to the notes on heartbeats,
max_hbrlck_fails
should be enabled in order to determine when heartbeats stop arriving. This is required because the receive thread of the MCO daemon spends most of its time waiting for something to come in over the wire. That ties up the "read lock" used to provide exclusive access to the underlying network socket.However, the blocking receive should consume heartbeats and update the "last received" timestamp (
@lr
). The heartbeat read thread also consumes pings coming in over the wire in case the process is not blocking on receive.Normal operation means that a missed hbrlck is normal, but two in a row should be abnormal (the heartbeat thread may try to read and get locked, but the receive thread should read the heartbeat before the heartbeat thread resumes - networking timing could impact this in bad ways - and reset the "last received" timestamp which will trigger resetting the
read_lock_count
andlock_fail_count
the next time the heartbeat thread wakes).When the
lock_fail_count
exceeds@max_hbrlck_fails
, the heartbeat thread closes the socket and kills the heartbeat threads. The expectation is that this should cause theIO#gets
call in the receive call to return. That doesn't happen until TCP timeouts occur, which can take minutes or hours. When the TCP timeout occurs, it appears to leave the main thread in an unexpected state that sometimes results in SIGABRT and is unreachable via the broker.To exit
gets
early, trigger an exception in thereceive
thread. That interrupts thegets
call and is caught in thetransmit
or__old_receive
method and initiates a reconnect.