kbaseattic / assembly

An extensible framework for genome assembly.
MIT License
12 stars 14 forks source link

socket error on elm #201

Closed levinas closed 10 years ago

levinas commented 10 years ago
|  291   |   183   |    Stage 2/4: a6     | 0:12:37  |     b93.rast     |
|  292   |   183   |    Stage 2/3: a6     | 0:41:08  |  b93.rast_fast   |

After a while, it becomes:

|  291   |   183   | [FAIL] [Errno 32] Broken pipe | 0:15:02  |       b93.rast       |
|  292   |   183   | [FAIL] [Errno 32] Broken pipe | 0:15:00  |    b93.rast_fast     |

I have not noticed this happening elsewhere.

This may have been caused by broken pipe. There are still some spades processes running.

levinas commented 10 years ago

@cbun. Jobs got restarted after this error message in compute server on elm:

Upload complete: /disks/arast/fangfang/ar-test-data/fangfang/209/325/325_report.txt
ERROR:pika.adapters.base_connection:Socket Error on fd 7: 104
WARNING:pika.adapters.blocking_connection:Received Channel.Close, closing: None
Process [Worker 1]::
Traceback (most recent call last):
  File "/vol/kbase/runtime/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/vol/kbase/runtime/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/disks/arast/fangfang/assembly/lib/assembly/consume.py", line 376, in start
    self.fetch_job()
  File "/disks/arast/fangfang/assembly/lib/assembly/consume.py", line 350, in fetch_job
    channel.start_consuming()
  File "/vol/kbase/runtime/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 722, in start_consuming
    self.connection.process_data_events()
  File "/vol/kbase/runtime/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 88, in process_data_events
    if self._handle_read():
  File "/vol/kbase/runtime/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 184, in _handle_read
    super(BlockingConnection, self)._handle_read()
  File "/vol/kbase/runtime/lib/python2.7/site-packages/pika/adapters/base_connection.py", line 308, in _handle_read
    self._on_data_available(data)
  File "/vol/kbase/runtime/lib/python2.7/site-packages/pika/connection.py", line 1138, in _on_data_available
    self._process_frame(frame_value)
  File "/vol/kbase/runtime/lib/python2.7/site-packages/pika/connection.py", line 1193, in _process_frame
    self._deliver_frame_to_channel(frame_value)
  File "/vol/kbase/runtime/lib/python2.7/site-packages/pika/connection.py", line 843, in _deliver_frame_to_channel
    return self._channels[value.channel_number]._handle_content_frame(value)
  File "/vol/kbase/runtime/lib/python2.7/site-packages/pika/channel.py", line 776, in _handle_content_frame
    self._on_deliver(*response)
  File "/vol/kbase/runtime/lib/python2.7/site-packages/pika/channel.py", line 851, in _on_deliver
    body)
  File "/disks/arast/fangfang/assembly/lib/assembly/consume.py", line 372, in callback
    ch.basic_ack(delivery_tag=method.delivery_tag)
  File "/vol/kbase/runtime/lib/python2.7/site-packages/pika/channel.py", line 138, in basic_ack
    return self._send_method(spec.Basic.Ack(delivery_tag, multiple))
  File "/vol/kbase/runtime/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 920, in _send_method
    self.connection.send_method(self.channel_number, method_frame, content)
  File "/vol/kbase/runtime/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 120, in send_method
    self._send_method(channel_number, method_frame, content)
  File "/vol/kbase/runtime/lib/python2.7/site-packages/pika/connection.py", line 1331, in _send_method
    self._send_frame(frame.Method(channel_number, method_frame))
  File "/vol/kbase/runtime/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 245, in _send_frame
    super(BlockingConnection, self)._send_frame(frame_value)
  File "/vol/kbase/runtime/lib/python2.7/site-packages/pika/connection.py", line 1318, in _send_frame
    self._flush_outbound()
  File "/vol/kbase/runtime/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 205, in _flush_outbound
    if self._handle_write():
  File "/vol/kbase/runtime/lib/python2.7/site-packages/pika/adapters/base_connection.py", line 320, in _handle_write
    return self._handle_error(error)
  File "/vol/kbase/runtime/lib/python2.7/site-packages/pika/adapters/base_connection.py", line 264, in _handle_error
    self._handle_disconnect()
  File "/vol/kbase/runtime/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 181, in _handle_disconnect
    self._on_connection_closed(None, True)
  File "/vol/kbase/runtime/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 232, in _on_connection_closed
    self._channels[channel]._on_close(method_frame)
  File "/vol/kbase/runtime/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 817, in _on_close
    self._send_method(spec.Channel.CloseOk(), None, False)
  File "/vol/kbase/runtime/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 920, in _send_method
    self.connection.send_method(self.channel_number, method_frame, content)
  File "/vol/kbase/runtime/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 120, in send_method
    self._send_method(channel_number, method_frame, content)
  File "/vol/kbase/runtime/lib/python2.7/site-packages/pika/connection.py", line 1331, in _send_method
    self._send_frame(frame.Method(channel_number, method_frame))
  File "/vol/kbase/runtime/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 245, in _send_frame
    super(BlockingConnection, self)._send_frame(frame_value)
  File "/vol/kbase/runtime/lib/python2.7/site-packages/pika/connection.py", line 1312, in _send_frame
    raise exceptions.ConnectionClosed
ConnectionClosed
[Worker 1]:  [*] Fetching job...
 [+] Incoming: ARASTUSER: fangfang, job_id: 325, message: b93.rast_fast
sebhtml commented 10 years ago

ERROR:pika.adapters.base_connection:Socket Error on fd 7: 104

What's pika ? I assume it's a library for connecting with TCP sockets.

cbun commented 10 years ago

It is the python libraries for AMQP

levinas commented 10 years ago

The versions of pika installed on elm and exp are different. The elm one is older:

elm:/vol/kbase/runtime/lib/python2.7/site-packages/pika/init.py version = '0.9.8'

exp:/usr/local/lib/python2.7/dist-packages/pika/init.py version = '0.9.13'

levinas commented 10 years ago

yum whatprovides "/usr/sbin/rabbitmq-server" rabbitmq on elm: rabbitmq-server-3.1.5-1.el6.noarch : The RabbitMQ server

dpkg -l |grep rabbit rabbitmq on exp: ii rabbitmq-server 2.7.1-0ubuntu4 An AMQP server written in Erlang

levinas commented 10 years ago

Getting socket errors after jobs have been running for around 65 minutes, and it causes them to fail with a ‘broken pipe’. And because rabbitmq does not receive an ack, the job sometimes get restarted.

Example:

|  331   |   209   |        Stage 5/9: idba        | 1:03:52  |    b93.slow   |
|  332   |   209   |            Complete           | 0:48:52  |    b93.auto   |
|  333   |   209   |            Complete           | 0:40:10  | b93.rast_fast |
|  334   |   209   |         Stage 3/4: a6         | 1:03:23  |    b93.rast   |
Upload complete: /disks/arast/fangfang/ar-test-data/fangfang/209/333/333_report.txt (previous job)

ERROR:pika.adapters.base_connection:Socket Error on fd 12: 104
WARNING:pika.adapters.base_connection:Socket closed when connection was open
…
  File "/vol/kbase/runtime/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 371, in _check_state_on_disconnect
    raise exceptions.ConnectionClosed()
…
|  331   |   209   | [FAIL] [Errno 32] Broken pipe | 0:37:16  |    b93.slow   |
|  332   |   209   |            Complete           | 0:48:52  |    b93.auto   |
|  333   |   209   |            Complete           | 0:40:10  | b93.rast_fast |
|  334   |   209   | [FAIL] [Errno 32] Broken pipe | 1:05:19  |    b93.rast   |

The same jobs succeed on magellan (~70 minutes).

sebhtml commented 10 years ago

Hypothesis

pika and rabbitmq don't agree on the way to receive and send messages because of the difference version.

Action

Try with the same version that are used in Magellan.

levinas commented 10 years ago

This has been resolved by disabling RabbitMQ heartbeat check.