nicholasdavidson / pybit

Python Build Integration Toolkit - a distributed cross platform AMQP based build system
17 stars 5 forks source link

"Unable to declare or bind to command channel." #110

Closed jamesbennet closed 11 years ago

jamesbennet commented 11 years ago

The man doing PAT testing meant two boxes got powered off unceremoniously.

When they came back up, they said "Unable to declare or bind to command channel." and ended up stuck between states. I had to reset rabbit to remove the lock on their private queue.

jamesbennet commented 11 years ago

Here is some sanitised log output detailing what happened:

2013-01-25 16:45:37,938 Moved from CLEAN to IDLE
[So was working fine late Friday. Power outage happened sometime late on the 28th ]
1970-01-01 00:00:30,255 Daemonised
1970-01-01 00:00:30,757 I: Running build client.
1970-01-01 00:00:30,822 List of available handlers: ['svn', 'git', 'apt']
1970-01-01 00:00:30,824 List of available distributions: ['Debian']
1970-01-01 00:00:30,825 Using Debian build client
1970-01-01 00:00:30,826 Moved from UNKNOWN to IDLE
1970-01-01 00:00:30,964 Open OK! known_hosts [snip]
1970-01-01 00:00:30,966 using channel_id: 1
1970-01-01 00:00:30,970 Channel open
1970-01-01 00:00:30,971 using channel_id: 2
1970-01-01 00:00:30,975 Channel open
1970-01-01 00:00:30,980 Creating queue with name:Debian_armel_development_deb
1970-01-01 00:00:30,985 Creating queue with name:Debian_armel_illgill_deb
1970-01-01 00:00:30,990 Creating private command queue with name: [snip]
[Then this happens]
1970-01-01 00:00:30,995 Closed channel #1
1970-01-01 00:00:30,996 Unable to declare or bind to command channel.
[Then I guess NTP kicked in]
2013-01-28 16:24:52,965 Starting new HTTP connection (1): [snip]
2013-01-28 16:24:57,983 "GET /job/14/status HTTP/1.1" 200 136
2013-01-28 16:24:58,182 Marking JOB id: 14 as: Building
2013-01-28 16:24:58,193 Running: svn export [snip]
2013-01-28 16:24:58,207 Starting new HTTP connection (1): [snip]
2013-01-28 16:24:58,248 "PUT /job/14 HTTP/1.1" 200 20
2013-01-28 16:24:58,256 Moved from IDLE to CHECKOUT
2013-01-28 16:24:58,261 Closed channel #2
2013-01-28 16:24:59,439 Open OK! known_hosts [snip]
2013-01-28 16:24:59,441 using channel_id: 1
2013-01-28 16:24:59,445 Channel open
2013-01-28 16:24:59,491 Closed channel #108 

No more output after that point. The daemon was running but stalled in between states. It should have bailed to fatal error state and halted as soon as it got "Unable to declare or bind to command channel.", rather than try and check out a job and bork it.

This has happened before, but I thought we fixed it?

jamesbennet commented 11 years ago

Yes, in init.py, connect() returns False, if it gets a AMQPChannelException. but then I cant understand what happens next... what does the magic enter() do?

jamesbennet commented 11 years ago

This is fixed, clients were not up to date. Still buggyish though.