Closed levinas closed 10 years ago
@cbun. Jobs got restarted after this error message in compute server on elm:
Upload complete: /disks/arast/fangfang/ar-test-data/fangfang/209/325/325_report.txt
ERROR:pika.adapters.base_connection:Socket Error on fd 7: 104
WARNING:pika.adapters.blocking_connection:Received Channel.Close, closing: None
Process [Worker 1]::
Traceback (most recent call last):
File "/vol/kbase/runtime/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/vol/kbase/runtime/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/disks/arast/fangfang/assembly/lib/assembly/consume.py", line 376, in start
self.fetch_job()
File "/disks/arast/fangfang/assembly/lib/assembly/consume.py", line 350, in fetch_job
channel.start_consuming()
File "/vol/kbase/runtime/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 722, in start_consuming
self.connection.process_data_events()
File "/vol/kbase/runtime/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 88, in process_data_events
if self._handle_read():
File "/vol/kbase/runtime/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 184, in _handle_read
super(BlockingConnection, self)._handle_read()
File "/vol/kbase/runtime/lib/python2.7/site-packages/pika/adapters/base_connection.py", line 308, in _handle_read
self._on_data_available(data)
File "/vol/kbase/runtime/lib/python2.7/site-packages/pika/connection.py", line 1138, in _on_data_available
self._process_frame(frame_value)
File "/vol/kbase/runtime/lib/python2.7/site-packages/pika/connection.py", line 1193, in _process_frame
self._deliver_frame_to_channel(frame_value)
File "/vol/kbase/runtime/lib/python2.7/site-packages/pika/connection.py", line 843, in _deliver_frame_to_channel
return self._channels[value.channel_number]._handle_content_frame(value)
File "/vol/kbase/runtime/lib/python2.7/site-packages/pika/channel.py", line 776, in _handle_content_frame
self._on_deliver(*response)
File "/vol/kbase/runtime/lib/python2.7/site-packages/pika/channel.py", line 851, in _on_deliver
body)
File "/disks/arast/fangfang/assembly/lib/assembly/consume.py", line 372, in callback
ch.basic_ack(delivery_tag=method.delivery_tag)
File "/vol/kbase/runtime/lib/python2.7/site-packages/pika/channel.py", line 138, in basic_ack
return self._send_method(spec.Basic.Ack(delivery_tag, multiple))
File "/vol/kbase/runtime/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 920, in _send_method
self.connection.send_method(self.channel_number, method_frame, content)
File "/vol/kbase/runtime/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 120, in send_method
self._send_method(channel_number, method_frame, content)
File "/vol/kbase/runtime/lib/python2.7/site-packages/pika/connection.py", line 1331, in _send_method
self._send_frame(frame.Method(channel_number, method_frame))
File "/vol/kbase/runtime/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 245, in _send_frame
super(BlockingConnection, self)._send_frame(frame_value)
File "/vol/kbase/runtime/lib/python2.7/site-packages/pika/connection.py", line 1318, in _send_frame
self._flush_outbound()
File "/vol/kbase/runtime/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 205, in _flush_outbound
if self._handle_write():
File "/vol/kbase/runtime/lib/python2.7/site-packages/pika/adapters/base_connection.py", line 320, in _handle_write
return self._handle_error(error)
File "/vol/kbase/runtime/lib/python2.7/site-packages/pika/adapters/base_connection.py", line 264, in _handle_error
self._handle_disconnect()
File "/vol/kbase/runtime/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 181, in _handle_disconnect
self._on_connection_closed(None, True)
File "/vol/kbase/runtime/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 232, in _on_connection_closed
self._channels[channel]._on_close(method_frame)
File "/vol/kbase/runtime/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 817, in _on_close
self._send_method(spec.Channel.CloseOk(), None, False)
File "/vol/kbase/runtime/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 920, in _send_method
self.connection.send_method(self.channel_number, method_frame, content)
File "/vol/kbase/runtime/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 120, in send_method
self._send_method(channel_number, method_frame, content)
File "/vol/kbase/runtime/lib/python2.7/site-packages/pika/connection.py", line 1331, in _send_method
self._send_frame(frame.Method(channel_number, method_frame))
File "/vol/kbase/runtime/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 245, in _send_frame
super(BlockingConnection, self)._send_frame(frame_value)
File "/vol/kbase/runtime/lib/python2.7/site-packages/pika/connection.py", line 1312, in _send_frame
raise exceptions.ConnectionClosed
ConnectionClosed
[Worker 1]: [*] Fetching job...
[+] Incoming: ARASTUSER: fangfang, job_id: 325, message: b93.rast_fast
ERROR:pika.adapters.base_connection:Socket Error on fd 7: 104
What's pika ? I assume it's a library for connecting with TCP sockets.
It is the python libraries for AMQP
The versions of pika installed on elm and exp are different. The elm one is older:
elm:/vol/kbase/runtime/lib/python2.7/site-packages/pika/init.py version = '0.9.8'
exp:/usr/local/lib/python2.7/dist-packages/pika/init.py version = '0.9.13'
yum whatprovides "/usr/sbin/rabbitmq-server"
rabbitmq on elm:
rabbitmq-server-3.1.5-1.el6.noarch : The RabbitMQ server
dpkg -l |grep rabbit
rabbitmq on exp:
ii rabbitmq-server 2.7.1-0ubuntu4 An AMQP server written in Erlang
Getting socket errors after jobs have been running for around 65 minutes, and it causes them to fail with a ‘broken pipe’. And because rabbitmq does not receive an ack, the job sometimes get restarted.
Example:
| 331 | 209 | Stage 5/9: idba | 1:03:52 | b93.slow |
| 332 | 209 | Complete | 0:48:52 | b93.auto |
| 333 | 209 | Complete | 0:40:10 | b93.rast_fast |
| 334 | 209 | Stage 3/4: a6 | 1:03:23 | b93.rast |
Upload complete: /disks/arast/fangfang/ar-test-data/fangfang/209/333/333_report.txt (previous job)
ERROR:pika.adapters.base_connection:Socket Error on fd 12: 104
WARNING:pika.adapters.base_connection:Socket closed when connection was open
…
File "/vol/kbase/runtime/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 371, in _check_state_on_disconnect
raise exceptions.ConnectionClosed()
…
| 331 | 209 | [FAIL] [Errno 32] Broken pipe | 0:37:16 | b93.slow |
| 332 | 209 | Complete | 0:48:52 | b93.auto |
| 333 | 209 | Complete | 0:40:10 | b93.rast_fast |
| 334 | 209 | [FAIL] [Errno 32] Broken pipe | 1:05:19 | b93.rast |
The same jobs succeed on magellan (~70 minutes).
Hypothesis
pika and rabbitmq don't agree on the way to receive and send messages because of the difference version.
Action
Try with the same version that are used in Magellan.
This has been resolved by disabling RabbitMQ heartbeat check.
After a while, it becomes:
I have not noticed this happening elsewhere.
This may have been caused by broken pipe. There are still some spades processes running.