While working on moving the testsuite to common_test (#39), I got transient failures with cases testing channel-level errors, especially:
shortstr_overflow_property
shortstr_overflow_field
channel_writer_death
The problem comes from a race condition in the supervision tree for a particular channel when using a network connection (as opposed to direct Erlang communication). The supervision tree looks like this:
amqp_channel sends commands to rabbit_writer who then sends frames on the network. If an error occurs in rabbit_writer, it sends a channel_exit message to amqp_channel and exits with the reason normal. amqp_channel receives the notification and exits with an error. Then normal supervision tree termination continues.
After this kind of error, the connection can't be trusted anymore so it must be taken down. This is the repsonsibility of amqp_channels_manager which is part of a larger supervision tree dedicated to a connection:
So when a channel exits with an error, amqp_channels_manager terminates the connection.
However, this relies on the fact that amqp_channel_sup is notified of amqp_channel exit beforerabbit_writer exit. If amqp_channel_sup notices the exit of rabbit_writer first (with reason normal), it terminates the supervision tree: amqp_channel exits with the reason shutdown and amqp_channels_manager considers this an expected termination. The connection is left alone, which is bad.
The issue is that rabbit_writer has its restart type set to intrinsic, which means that no matter its exit reason, if this child exits, the whole supervision tree must be terminated. This is wrong in this situation because rabbit_writer always sends a message to amqp_channel to notify its exit.
By using a restart type of transient, rabbit_writer still exits but the supervision tree is not taken down. Instead amqp_channel has time to receive the message from rabbit_writer and exits with an error.
While working on moving the testsuite to common_test (#39), I got transient failures with cases testing channel-level errors, especially:
shortstr_overflow_property
shortstr_overflow_field
channel_writer_death
The problem comes from a race condition in the supervision tree for a particular channel when using a network connection (as opposed to direct Erlang communication). The supervision tree looks like this:
amqp_channel
sends commands torabbit_writer
who then sends frames on the network. If an error occurs inrabbit_writer
, it sends achannel_exit
message toamqp_channel
and exits with the reasonnormal
.amqp_channel
receives the notification and exits with an error. Then normal supervision tree termination continues.After this kind of error, the connection can't be trusted anymore so it must be taken down. This is the repsonsibility of
amqp_channels_manager
which is part of a larger supervision tree dedicated to a connection:So when a channel exits with an error,
amqp_channels_manager
terminates the connection.However, this relies on the fact that
amqp_channel_sup
is notified ofamqp_channel
exit beforerabbit_writer
exit. Ifamqp_channel_sup
notices the exit ofrabbit_writer
first (with reasonnormal
), it terminates the supervision tree:amqp_channel
exits with the reasonshutdown
andamqp_channels_manager
considers this an expected termination. The connection is left alone, which is bad.The issue is that
rabbit_writer
has its restart type set tointrinsic
, which means that no matter its exit reason, if this child exits, the whole supervision tree must be terminated. This is wrong in this situation becauserabbit_writer
always sends a message toamqp_channel
to notify its exit.By using a restart type of
transient
,rabbit_writer
still exits but the supervision tree is not taken down. Insteadamqp_channel
has time to receive the message fromrabbit_writer
and exits with an error.