njh / ruby-mqtt

Pure Ruby gem that implements the MQTT protocol, a lightweight protocol for publish/subscribe messaging.
http://www.rubydoc.info/gems/mqtt
MIT License
538 stars 135 forks source link

Prevent deadlock due to system call error #159

Closed leoarnold closed 3 months ago

leoarnold commented 10 months ago

The MQTT client runs in the main thread which consumes incoming MQTT messages from a Thread::Queue called @read_queue. This queue is fed by the @read_thread, a child thread which reads from a socket in an infinite loop.

We noticed that our application stopped processing incoming MQTT messages, but did not seem to exit or throw an exception either. Upon inspecting the logs, we saw that the @read_thread had crashed due to an unhandled Errno::ECONNRESET while reading from the socket. This had the consuming thread then sleep forever while waiting for new meassages on the @read_queue.

It turned out that the MQTT::Client#receive_packet method had appropriate error handling in place

def receive_packet
  # ...
rescue Exception
  # ...
end

but this did not rescue Errno::ECONNRESET even though Exception is at the top of the hierarchy of Ruby's built-in exception classes.

The root cause was that the library also defines a class MQTT::Exception and in the context of #receive_packet the constant Exception refers only to MQTT::Exception (which does not cover Errno::ECONNRESET) where it actually should rescue any subclass of ::Exception.

leoarnold commented 10 months ago

@phlegx FYI

blmundie commented 8 months ago

I'm running into a threaded processing lockup that seems to be a similar issue

bmorrall commented 4 months ago

Any follow up on getting this merged? This was a real problem for us a few months ago.

We found setting Thread.current.abort_on_exception = true would at least kill the listening process for us (and allow our system to reboot and rebuild the connection), but this fix is much cleaner and less janky.

njh commented 3 months ago

Thanks for this. The exception code has been wrong for a very long time and it is long overdue getting sorted out.

According to Semver, do you think this change is worthy of a Major, Minor or Patch increment? 🤔

leoarnold commented 3 months ago

Hey @njh, good to see this progressing!

Major version zero (0.y.z) is for initial development. Anything MAY change at any time. The public API SHOULD NOT be considered stable. https://semver.org/#spec-item-4

Since you're just merging fixes (I assume) I'd argue that is only a PATCH increment. If you're also taking #112, that one mandates a cautionary note in CHANGELOG.md.