Race condition during a channel failure handling

dumbbell commented 8 years ago

While working on moving the testsuite to common_test (#39), I got transient failures with cases testing channel-level errors, especially:

shortstr_overflow_property
shortstr_overflow_field
channel_writer_death

The problem comes from a race condition in the supervision tree for a particular channel when using a network connection (as opposed to direct Erlang communication). The supervision tree looks like this:

amqp_channel_sup
|-- amqp_gen_consumer
|-- amqp_channel
`-- rabbit_writer

amqp_channel sends commands to rabbit_writer who then sends frames on the network. If an error occurs in rabbit_writer, it sends a channel_exit message to amqp_channel and exits with the reason normal. amqp_channel receives the notification and exits with an error. Then normal supervision tree termination continues.

After this kind of error, the connection can't be trusted anymore so it must be taken down. This is the repsonsibility of amqp_channels_manager which is part of a larger supervision tree dedicated to a connection:

amqp_connection_sup
|-- amqp_connection_type_sup
|   |-- amqp_channel_sup_sup
|   |   `-- amqp_channel_sup
|   |       `-- ... # The channel supervision tree described above
|   |-- amqp_channels_manager
|   |-- rabbit_writer
|   |-- amqp_main_reader
|   |-- rabbit_heartbeat
|   `-- rabbit_heartbeat
`-- amqp_gen_connection

So when a channel exits with an error, amqp_channels_manager terminates the connection.

However, this relies on the fact that amqp_channel_sup is notified of amqp_channel exit before rabbit_writer exit. If amqp_channel_sup notices the exit of rabbit_writer first (with reason normal), it terminates the supervision tree: amqp_channel exits with the reason shutdown and amqp_channels_manager considers this an expected termination. The connection is left alone, which is bad.

The issue is that rabbit_writer has its restart type set to intrinsic, which means that no matter its exit reason, if this child exits, the whole supervision tree must be terminated. This is wrong in this situation because rabbit_writer always sends a message to amqp_channel to notify its exit.

By using a restart type of transient, rabbit_writer still exits but the supervision tree is not taken down. Instead amqp_channel has time to receive the message from rabbit_writer and exits with an error.

michaelklishin commented 8 years ago

@dumbbell can this go into 3.6.x?

dumbbell commented 8 years ago

Yes, I believe so. I'm preparing the pull request.

dumbbell commented 8 years ago

I wonder if this issue is related to #17.

rabbitmq / rabbitmq-erlang-client

Race condition during a channel failure handling #42