rabbitmq / rabbitmq-erlang-client

Erlang client for RabbitMQ
https://www.rabbitmq.com/
Other
184 stars 127 forks source link

Seqno reset to 1 on every confirm.select #27

Closed dumbbell closed 8 years ago

dumbbell commented 8 years ago

The Confirms extension specification states:

Once a channel is in confirm mode, both the broker and the client count messages (counting starts at 1 on the first confirm.select).

That's what RabbitMQ does in rabbit_channel.erl:

  1. The sequence number is set to 1 during channel init:

    init(...) ->
       % ...
       State = #ch{state                   = starting,
                   % ...
                   confirm_enabled         = false,
                   publish_seqno           = 1,
  2. When a confirm.select is received, a single flag is toggled and the sequence number is left untouched:

    handle_method(#'confirm.select'{nowait = NoWait}, _, State) ->
       return_ok(State#ch{confirm_enabled = true},
                 NoWait, #'confirm.select_ok'{});

Unfortunately, rabbitmq-erlang-client resets the seqno to 1 with every confirm.select:

handle_method_to_server(...) ->
        case ... of
            {ok, _, ok} ->
                State1 = case {Method, State#state.next_pub_seqno} of
                             {#'confirm.select'{}, _} ->
                                                 % ^ The current seqno is ignored
                                 State#state{next_pub_seqno = 1};

The obvious consequence is that a second confirm.select from a client on a channel will lead to an inconsistency between the broker and the client counters. The client will receive ACKs for messages it never published yet.

The code should only set the seqno to 1 if it's set to 0 (0 meaning that confirms are disabled).

dumbbell commented 8 years ago

The problem was discovered with make standalone-tests FILTER=eager_sync:eager_sync: the testcase hangs forever because it waits for confirms for messages in the range 1..2000 (its seqno was reset after the second confirm.select), whereas the broker sends ACKs for messages in the range 4001..6000.

What is weird is that Travis CI reproduces the hang 100% of the time. I can reproduce it 90% of the time. Jenkins never hits it. There must be a race in the client or the testcase which could hide the problem.