rabbitmq / rabbitmq-amqp1.0

AMQP 1.0 support for RabbitMQ
https://www.rabbitmq.com/
Other
93 stars 20 forks source link

heartbeat frames seem to cause connection failure #6

Closed mbroadst closed 9 years ago

mbroadst commented 9 years ago

I'm working on a node amqp client and trying to run integration tests against rabbitmq, but it seems every time the internal heartbeat timer sends an empty (heartbeat) message rabbitmq immediately closes the connection, resulting in a local ECONNRESET.

Section 2.4.5 of the spec indicates that "Implementations MUST be prepared to handle empty frames arriving on any valid channel.." so this seems like a bug to me if that is indeed what is happening. To that end, I have no idea if that's actually what's happening because there isn't any trace information in the local rabbitmq log. Perhaps you could direct me as to how I might get more information?

simonmacmullen commented 9 years ago

I believe we should handle AMQP 1.0 heartbeats. I could also believe that the code to do so might never have been tested.

We should be logging any crash or failure; if there is nothing at all in the main or sasl log then that is itself a bug.

I don't know anything about node.js, so in order to help look at this from the RabbitMQ end it would be really helpful to me if you could post a short set of instructions on how to run your client from scratch to expose this.

mbroadst commented 9 years ago

@simonmacmullen thanks for the reply, so the raw hex being sent over the wire to cause this bug is "0000000802000000" if that helps with some sort of internal rabbit test. As for setting up a simple test, you can follow these steps:

At this point (if things are configured properly and the rabbitmq server is reachable) a whole bunch of debug data will be spewed out at you, and "connected" will show up at the bottom. After a short period of time you'll see a message indicating that it sent the heartbeat, and the corresponding ECONNRESET socket error

mbroadst commented 9 years ago

@simonmacmullen I was looking for some unit tests I could perhaps add the raw data to, but it doesn't seem to exist per se. Do you have any unit tests I could look at to possibly submit a broken test case for?

simonmacmullen commented 9 years ago

Unfortunately unit tests for the AMQP 1.0 plugin are quite weak.

Anyway, this was easy enough to fix. I could replicate it with your test, but I did see errors in the SASL log when it crashed on receiving a heartbeat - didn't you?

Heartbeats now look like they work, woo!

mbroadst commented 9 years ago

@simonmacmullen I must have missed the sasl log, hrm. Well I'm glad you solved it! As I mentioned on the mailing list, I intend to create a suite of integration tests for node-amqp10 (as it shall soon be called) running against rabbitmq on travis, so hopefully that can establish a level of consistency for the plugin

mbroadst commented 9 years ago

@simonmacmullen hm looks like I'm still getting a crash with the nightly build, albeit a new one this time:

=CRASH REPORT==== 24-Mar-2015::09:38:08 ===
  crasher:
    initial call: rabbit_reader:init/2
    pid: <0.2560.0>
    registered_name: []
    exception error: no function clause matching 
                     rabbit_amqp1_0_reader:recvloop([],
                                                    {{v1,<0.2558.0>,
                                                      #Port<0.9056>,
                                                      {connection,
                                                       {user,<<"admin">>,
                                                        [administrator],
                                                        [{rabbit_auth_backend_internal,
                                                          none}]},
                                                       3,4294967295,
                                                       {<<"PLAIN">>,
                                                        rabbit_auth_mechanism_plain},
                                                       [],
                                                       {utf8,
                                                        <<"192.168.1.104">>}},
                                                      {frame_header_1_0,amqp},
                                                      8,false,running,
                                                      <0.2562.0>,
                                                      {<0.2563.0>,<0.2564.0>},
                                                      <0.2559.0>,<0.2561.0>,
                                                      [<<>>],
                                                      0,
                                                      {throttle,[],none,
                                                       never}},
                                                     {frame_header_1_0,amqp},
                                                     8}) 
      in function  rabbit_reader:run/1 
      in call from rabbit_reader:start_connection/5 
    ancestors: [<0.2558.0>,rabbit_tcp_client_sup,rabbit_sup,<0.143.0>]
    messages: [{'EXIT',#Port<0.9056>,normal}]
    links: [<0.2558.0>]
    dictionary: [{{channel,1},{ch_fr_pid,<0.2567.0>}},
                  {{credit_from,<0.2567.0>},197},
                  {process_name,
                      {rabbit_reader,
                          <<"192.168.1.100:57414 -> 192.168.1.104:5672">>}},
                  {{ch_fr_pid,<0.2567.0>},{channel,1}},
                  {{ch_sup_pid,<0.2565.0>},
                   {{channel,1},{ch_fr_pid,<0.2567.0>}}}]
    trap_exit: true
    status: running
    heap_size: 987
    stack_size: 27
    reductions: 4540
  neighbours:

=SUPERVISOR REPORT==== 24-Mar-2015::09:38:08 ===
     Supervisor: {<0.2558.0>,rabbit_connection_sup}
     Context:    child_terminated
     Reason:     function_clause
     Offender:   [{pid,<0.2560.0>},
                  {name,reader},
                  {mfargs,{rabbit_reader,start_link,[<0.2559.0>]}},
                  {restart_type,intrinsic},
                  {shutdown,4294967295},
                  {child_type,worker}]

=SUPERVISOR REPORT==== 24-Mar-2015::09:38:08 ===
     Supervisor: {<0.2558.0>,rabbit_connection_sup}
     Context:    shutdown
     Reason:     reached_max_restart_intensity
     Offender:   [{pid,<0.2560.0>},
                  {name,reader},
                  {mfargs,{rabbit_reader,start_link,[<0.2559.0>]}},
                  {restart_type,intrinsic},
                  {shutdown,4294967295},
                  {child_type,worker}]
simonmacmullen commented 9 years ago

@mbroadst: that's the same issue. It looks like nightlies have not actually succeeded for the last week or so (and they're on machines on a network which is quite flaky and which I can't currently access...)

mbroadst commented 9 years ago

@simonmacmullen hrm thats strange, I don't recall seeing any errors on heartbeat previously (though we identified that I might have missed on in the sasl log). This was from the main rabbit log. Anyway, if it's as you say I'll keep an eye out for a new nightly build, thanks!

mbroadst commented 9 years ago

@simonmacmullen any idea when the nightlies will be back up? Do you guys have a script for building the release debs (iirc the rabbitmq-public-umbrella's make release doesn't seem to make the exact artifact)

dumbbell commented 9 years ago

Hi! I just answered you on IRC, but for the sake of completeness, I do it again here: the Pivotal internal network is under maintenance for several days and nightlies are not built at this time.

mbroadst commented 9 years ago

@simonmacmullen finally got around to building from source, and it looks like this indeed fixed. thanks.