radical-collaboration / extasy-grlsd

Repository to hold the input data and scripts for the ExTASY gromacs-lsdmap work
1 stars 1 forks source link

partial connection issues #93

Closed euhruska closed 5 years ago

euhruska commented 5 years ago

from the same rabbitmq I launch several entk runs, some work, but some fail with connection issues as below. My question is how can another entk run on the same computer, same rabbitmq work while this fails?

Traceback (most recent call last):
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/entk/execman/base/task_manager.py", line 151, in _heartbeat
    method_frame, props, body = mq_channel.basic_get(queue=self._hb_response_q)
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 2077, in basic_get
    self._basic_getempty_result.is_ready)
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 1292, in _flush_output
    *waiters)
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 458, in _flush_output
    self._impl.ioloop.poll()
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/pika/adapters/select_connection.py", line 495, in poll
    self._poller.poll()
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/pika/adapters/select_connection.py", line 1114, in poll
    self._dispatch_fd_events(fd_event_map)
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/pika/adapters/select_connection.py", line 831, in _dispatch_fd_events
    handler(fileno, events)
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/pika/adapters/base_connection.py", line 410, in _handle_events
    self._handle_read()
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/pika/adapters/base_connection.py", line 464, in _handle_read
    self._on_data_available(data)
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/pika/connection.py", line 2021, in _on_data_available
    self._process_frame(frame_value)
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/pika/connection.py", line 2142, in _process_frame
    if self._process_callbacks(frame_value):
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/pika/connection.py", line 2123, in _process_callbacks
    frame_value)  # Args
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/pika/callback.py", line 60, in wrapper
    return function(*tuple(args), **kwargs)
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/pika/callback.py", line 92, in wrapper
    return function(*args, **kwargs)
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/pika/callback.py", line 236, in process
    callback(*args, **keywords)
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 1358, in _on_channel_closed
    method.reply_text)
ChannelClosed: (404, "NOT_FOUND - no queue 're.session.leonardo.rice.edu.eh22.017795.0008-hb-response' in vhost '/'")
vivek-bala commented 5 years ago

Can you attach the entire log please? The traceback alone isn't always helpful.

euhruska commented 5 years ago

log_extasy_tica3_villin_small.log

euhruska commented 5 years ago

happened again, fatal. Sometimes restarting helps, but this issue sometimes persists for days.

vivek-bala commented 5 years ago

Thanks Eugen, looking into this now. Can you tell me how many concurrent tasks you have and how long each task runs for?

euhruska commented 5 years ago

each task is 24h long and about 5-8 concurrent task

vivek-bala commented 5 years ago

Hmmm, that's odd. I see about 100 tasks that are submitted for concurrent execution from the log. Can you point me to your script and the specific parameters that you use please.

euhruska commented 5 years ago

wait, i meant 100 tasks, but 5-8 concurrent independent extasy runs

euhruska commented 5 years ago

the md step is the most concurrent step: https://github.com/ClementiGroup/extasy-koopman/blob/master/extasy_tica3.py#L126

vivek-bala commented 5 years ago

I think I understand why this is happening. Can you set an environment variable ENTK_HB_INTERVAL=90? That's a short term fix. Long term fix is https://github.com/radical-cybertools/radical.entk/issues/270. I'll try to get to the ticket over the weekend. I'll keep you posted.

Basically, the verbose printing to stdout/stderr, task creating and submission are interfering with a timeout. The more you print (or more the number of tasks (>=4K tasks) - which is not the case in your trials), the longer it takes. Additionally, you might want to consider trying one individual run to keep the load on your machine low.

euhruska commented 5 years ago

thank you, sounds good

vivek-bala commented 5 years ago

Hey Eugen, did the HB_INTERVAL setup help? The long term fix would take more time than I had initially estimated unfortunately. How are you experiments going?

euhruska commented 5 years ago

Haven't seen this error since. Sometimes I have to restart everything to fix the improver termination issue, but currently my main issues are https://github.com/radical-collaboration/extasy-grlsd/issues/98 (fatal) and https://github.com/radical-collaboration/extasy-grlsd/issues/95 (nice to have). Beside that running more iterations to get convergence.

vivek-bala commented 5 years ago

Great, glad to hear that. I have responded to #98 and Andre will probably ping back with suggestions on #95 at the end of the week.