Handle other connection failure reasons

timclassic commented 8 years ago

I often receive {error,{socket_closed_unexpectedly,'connection.start'}} when calling amqp_connection:start/1 against RabbitMQ 3.5.6 while it is starting up or shutting down. This change adds a fallback retry for this and other unexpected failure reasons instead of crashing.

Without this change, starting and stopping my RabbitMQ nodes while turtle was pointed at it would reliably bring down the whole node. On my systems, the issue is quite easy to replicate if I enable HiPE compilation at startup, but it also seems easy to replicate during shutdown.

timclassic commented 8 years ago

My current theory is that my socket_closed_unexpectedly error may be due to Docker networking weirdness while starting and stopping containers. I wonder the container is accepting the connection since the RabbitMQ port is exposed, but then closes as soon as data starts flowing because RabbitMQ is not available yet inside the container.

jlouis commented 8 years ago

This looks like a good change we should support.

I wonder if it is better to explicitly handle errors we know rather than handling a catch-all in the retry loop. That way, we will terminate the node quickly if something is amiss we did not expect as programmers. On the other hand, as long as we log the troublemaker, it is probably good enough.

jlouis commented 8 years ago

I'll merge some other changes, and then get to the point of bumping turtle to a new version.

timclassic commented 8 years ago

I also thought for a while about this. I agree that this is slightly defensive, but if we trust that amqp_client will not hide crashable offenses in {error, Reason} tuples (e.g. {error, {badmatch, foo}} or some such nonsense) then I also feel that this is good enough.

As a user of this library, my expectation was that turtle_conn would handle arbitrary network oddities that exist outside of the local VM and just keep retrying. I'm sure that the invariant described in the failure modes section of README.md had something to do with this.

Hmm, I just tested some more and found that RabbitMQ authentication failures are caught by this new clause and the system retries. In my environment this is desired, though perhaps this case deserves its own error message. Thoughts?

Finally, @jlouis, thank you for the merge and the library!

jlouis commented 8 years ago

Ok, my spidey sense is starting to tingle. Between the general lack of OTP'ness in the amqp_client and what you are writing, I'd very much accept a patch that is targeting the error more specifically. I have a hunch that will be better, though I can only go by intuition on this.

And indeed, Turtle has served us pretty well at ShopGun :)

timclassic commented 8 years ago

Are you getting tingles due to my authentication failure discovery? I'm happy to do some digging and see if I can identify the the realistic set of Reasons that we may get when a connection fails, and enumerate them in the case clause.

I fear crashes in this location because it reliably brought down the turtle app and halted the Erlang node when turtle_conn restarted 5 times in quick succession due to the duration of the strange network condition. You could certainly argue that this is desired and that there should be an external supervisor that will reliably restart the VM. How is this handled in your environment?

(As an aside, I intend that my environment be as OTPish as possible. After all, that's what drew me to Turtle in the first place. Is it my desire for retries on auth failures that makes you suspect that it is not, or is it something else? I always appreciate advice in this area.)

So I know what you're thinking, can you think of any conditions that may be detectable where you would prefer that turtle_conn crashes?

jlouis commented 8 years ago

It isn't only your findings, but in general. If you have the time to go through the path in the client and figure out what can also happen, it would be pretty good.

The ideology of the reconnect logic has been: "is it permanent"? If the error is a permanent one, it errors and kills the tree. If the error is transient on the other hand, it keeps retrying. Typical transient errors are connection timeouts and also connection attempts where there is no listener in the server-end. Most of these are due to RabbitMQ either rebooting, or a transient network error. So we keep the system alive. But a wrongly configured password is a permanent error though. I think the error you saw is in the transient class, but I may be wrong.

jlouis / turtle

Handle other connection failure reasons #10