streadway / amqp

Go client for AMQP 0.9.1
http://godoc.org/github.com/streadway/amqp
BSD 2-Clause "Simplified" License
4.88k stars 621 forks source link

Cluster node serving a queue bouncing causes methods to hang #343

Closed doodle-tnw closed 5 years ago

doodle-tnw commented 6 years ago

We have a 5 node RabbitMQ cluster and a number of go apps using it for pub/sub.

The pub/sub code follows closely the pub/sub example, and if there is an error or consuming stops it tries to consume on another session. If i drop the rabbitmq broker in the cluster that serves the queue, it just sits there and hangs at channel.Consume (previously it was hanging at channel.QueueDeclare and channel.QueueBind, but they are now moved to only run first connection only).

        fmt.Println("Before consume")
        deliveries, err := sub.Consume(
            c.QueueName, // queue
            "",          // consumer
            false,       // auto-ack
            false,       // exclusive
            false,       // no-local
            false,       // no-wait
            nil,         // args
        )
        if err != nil {
            fmt.Println("failed at sub.consume")
            return fmt.Errorf("Cannot consume from: %q, %v", c.QueueName, err)
        }

        utils.Log("amqp: subscribed to events from: %s", c.Exchange)

In the logs i see the before consume but neither the subscribed message or the error message. It never seems to get past.

This was happening earlier on calls to channel.QueueDeclare and channel.QueueBind but as i said i moved them to test

The only thing i can think of is because the queue was on the broker that was taken down, that it has something to do with the fact that the broker is not back up yet before channel.Consume is called.

Any help greatly appreciated

michaelklishin commented 5 years ago

Details matter a lot here and we don't have server logs or a traffic capture, so I have to guess. There are two most common scenarios: the node that goes down is the node that hosted queue master or the node that this client was connected to. In the latter case your app must be ready to recover as demonstrated in the examples.

If a queue master fails and you were consuming from a different node, modern RabbitMQ versions will re-register the consumer after electing a new master. Which may or may not happen depending on queue settings documented in the mirroring guide. There are also Consumer cancelation notifications that are relevant here.