vivek-bala / radical.entk

MIT License
1 stars 1 forks source link

Persistent heart beats, no termination #40

Closed jdakka closed 6 years ago

jdakka commented 6 years ago

My application runs to completion, i.e. no FAILED units, however either EnTK or RP is not terminating properly. As a result rabbitmq continues to send pulses until wall times is reached.

stack:

radical.pilot        : 0.47-v0.46.2-193-g2da575bf@HEAD-detached-at-2da575bf
radical.utils        : 0.47-v0.47-3-g18dab2b@HEAD-detached-at-18dab2b
saga                 : 0.47-v0.46-34-g0f186f66@HEAD-detached-at-0f186f66

debug.log has the run's log, I gave you permission to access the agent files on BW:

/u/sciteam/dakka/scratch/radical.pilot.sandbox/rp.session.two.jdakka.017524.0007

data_for_vivek.zip

vivek-bala commented 6 years ago

Thanks Jumana. Can you point me to the EnTK script you are running as well?

jdakka commented 6 years ago

https://github.com/jdakka/radical.htbac/blob/PipelineOfEnsembles/src/radical/htbac/ties.py

vivek-bala commented 6 years ago

So I checked both the logs on the agent and the client. It seems like not all the units reach the DONE state. The application seems to have a total of 780 tasks.

Agent:

$ cat agent_0.staging.output.0.child.log | grep UMGR_STAGING_OUTPUT | uniq | wc -l
780

Client:

$ cat umgr.0000.staging.output.0.child.log | grep DONE | uniq | wc -l
586

@andre-merzky : I might need your help in this ticket. @jdakka : Is this issue recurring for you (do you get this error everytime you try)? for completion, can you also add the radical stack you are using please?

I'm going to keep looking for any error or warnings in the logs. Just wanted to update the ticket and bring in Andre to help debug.

vivek-bala commented 6 years ago

session_dump.zip

The DB also does not have all units in the completed state.

$ cat rp.session.two.jdakka.017524.0007.json | grep '"state": "DONE"' | wc -l
586

Not sure what's happening between the agent and the mongodb instance at this point.

jdakka commented 6 years ago

There appears to be a duplication UMGR_STAGING_OUTPUT_PENDING for all the untis 194 tasks do not reach UMGR_STAGING_OUTPUT

nwk-2-80:isc-htbac-experiments JumanaDakka$ cat /Users/JumanaDakka/Downloads/rp.session.two.jdakka.017524.0007.json | grep '"AGENT_STAGING_OUTPUT"' | wc -l
     780
nwk-2-80:isc-htbac-experiments JumanaDakka$ cat /Users/JumanaDakka/Downloads/rp.session.two.jdakka.017524.0007.json | grep '"UMGR_STAGING_OUTPUT_PENDING"' | wc -l
    1560
nwk-2-80:isc-htbac-experiments JumanaDakka$ cat /Users/JumanaDakka/Downloads/rp.session.two.jdakka.017524.0007.json | grep '"UMGR_STAGING_OUTPUT"' | wc -l
     586
jdakka commented 6 years ago

This is now urgent, @kristofarkas is facing the same issue, and I noticed a similar behavior.

nwk-2-80:rp.session.julian.chem.ucl.ac.uk.kristof.farkas-pall.017527.0000 JumanaDakka$ cat *.json | grep '"AGENT_STAGING_OUTPUT"' | wc -l
      61
nwk-2-80:rp.session.julian.chem.ucl.ac.uk.kristof.farkas-pall.017527.0000 JumanaDakka$ cat *.json | grep '"UMGR_STAGING_OUTPUT_PENDING"' | wc -l
     122
nwk-2-80:rp.session.julian.chem.ucl.ac.uk.kristof.farkas-pall.017527.0000 JumanaDakka$ cat *.json | grep '"UMGR_STAGING_OUTPUT"' | wc -l
      62

Duplication of UMGR_STAGING_OUTPUT_PENDING One of his units has an additional UMGR_STAGING_OUTPUT rp.sess.tar.gz

jdakka commented 6 years ago

I looked a bit deeper into this, unit.000060 is the one that gets stuck at UMGR_STAGING_OUTPUT and doesn't update to DONE. Even though he specified additional units past unit.000060 no more units are either submitted or created...depending on whether it's EnTK or RP.

The agent sandbox on Blue Waters is /u/sciteam/farkaspa/scratch/radical.pilot.sandbox/rp.session.julian.chem.ucl.ac.uk.kristof.farkas-pall.017527.0000

2017-12-27 17:27:46,575: umgr.0000           : task-manager                    : umgr.0000.subscriber._state_sub_cb: INFO    : [Callback]: unit unit.000060 state: UMGR_STAGING_OUTPUT.
2017-12-27 17:27:46,575: umgr.0000           : task-manager                    : umgr.0000.subscriber._state_sub_cb: DEBUG   : umgr.0000 calls state cb unit_state_cb for unit.000060
2017-12-27 17:27:47,961: umgr.0000           : task-manager                    : umgr.0000.idler._unit_pull_cb: INFO    : units pulled:    0
2017-12-27 17:27:49,500: umgr.0000           : task-manager                    : umgr.0000.idler._unit_pull_cb: INFO    : units pulled:    0
2017-12-27 17:27:51,041: umgr.0000           : task-manager                    : umgr.0000.idler._unit_pull_cb: INFO    : units pulled:    0
2017-12-27 17:27:52,571: umgr.0000           : task-manager                    : umgr.0000.idler._unit_pull_cb: INFO    : units pulled:    0
2017-12-27 17:27:54,104: umgr.0000           : task-manager                    : umgr.0000.idler._unit_pull_cb: INFO    : units pulled:    0
2017-12-27 17:27:55,640: umgr.0000           : task-manager                    : umgr.0000.idler._unit_pull_cb: INFO    : units pulled:    0
andre-merzky commented 6 years ago

Hey all,

I finally started to look into this. Jumana, could you please attach the pilot sandbox for rp.session.two.jdakka.017524.0007? Thanks!

andre-merzky commented 6 years ago

What I guess is that the UMGR either got stuck or killed before all units were pulled. The DB entries look ok, but the UMGR at last pulls unit.000663, then calls the unit's state callbacks, and then stalls. Either that state callback never returns, or something else stalls - I do not yet see what that could be.

jdakka commented 6 years ago

My issue is not as important as @kristofarkas. His sandbox on BW is /u/sciteam/farkaspa/scratch/radical.pilot.sandbox/rp.session.julian.chem.ucl.ac.uk.kristof.farkas-pall.017527.0000 and the client logs are rp.sess.tar.gz

jdakka commented 6 years ago

But I gave access to my sandbox /u/sciteam/dakka/scratch/radical.pilot.sandbox/rp.session.two.jdakka.017524.0007/ in case you need it

andre-merzky commented 6 years ago

For clarity: @kristofarkas' issue is the same as yours, right?

jdakka commented 6 years ago

Yes we both experience hanging units.

vivek-bala commented 6 years ago

I see the following error in the logs from Jumana and Kristoff:

Traceback (most recent call last):
  File "/home/jdakka/poE/venv/local/lib/python2.7/site-packages/radical/entk/appman/appmanager.py", line 812, in _synchronizer
    method_frame, props, body = mq_channel.basic_get(queue='enq-to-sync')
  File "/home/jdakka/poE/venv/local/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 2034, in basic_get
    self._basic_getempty_result.is_ready)
  File "/home/jdakka/poE/venv/local/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 1250, in _flush_output
    *waiters)
  File "/home/jdakka/poE/venv/local/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 455, in _flush_output
    self._impl.ioloop.poll()
  File "/home/jdakka/poE/venv/local/lib/python2.7/site-packages/pika/adapters/select_connection.py", line 245, in poll
    self._poller.poll()
  File "/home/jdakka/poE/venv/local/lib/python2.7/site-packages/pika/adapters/select_connection.py", line 935, in poll
    self._dispatch_fd_events(fd_event_map)
  File "/home/jdakka/poE/venv/local/lib/python2.7/site-packages/pika/adapters/select_connection.py", line 625, in _dispatch_fd_events
    handler(fileno, events)
  File "/home/jdakka/poE/venv/local/lib/python2.7/site-packages/pika/adapters/base_connection.py", line 395Exception in thread synchronizer-thread:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/jdakka/poE/venv/local/lib/python2.7/site-packages/radical/entk/appman/appmanager.py", line 869, in _synchronizer
    raise Error(text=ex)
Error: Error: (404, "NOT_FOUND - no queue 'enq-to-sync' in vhost '/'")

Seems like either the rabbitmq is unresponsive or terminated. Can you try a new rabbitmq instance? Possibly close others if you are not using them.

Also check if there are any stray processes from previous ctrl+c or ctrl+z actions and kill them if so. Let me know how it goes.

kristofarkas commented 6 years ago
2017-12-29 09:06:39,137: radical.entk.wfprocessor: wfprocessor                     : enqueue-thread : ERROR   : Transition of radi
cal.entk.stage.0010 to SCHEDULING state failed, error: (-1, "error(32, 'Broken pipe')")                                           
2017-12-29 09:06:39,138: radical.entk.wfprocessor: wfprocessor                     : enqueue-thread : ERROR   : Error while updati
ng stage state, rolling back. Error:                                                                                              
2017-12-29 09:06:39,139: radical.entk.wfprocessor: wfprocessor                     : enqueue-thread : ERROR   : Transition of radi
cal.entk.stage.0010 to DESCRIBED state failed, error:                                                                             
2017-12-29 09:06:39,140: radical.entk.wfprocessor: wfprocessor                     : enqueue-thread : ERROR   : Error in enqueue-t
hread:                                                                                                                            
Traceback (most recent call last):                                                                                                
  File "/Users/kristof.farkas-pall/radical-jdakka/lib/python2.7/site-packages/radical/entk/appman/wfprocessor.py", line 315, in _e
nqueue                                                                                                                            
    logger=self._logger)                                                                                                          
  File "/Users/kristof.farkas-pall/radical-jdakka/lib/python2.7/site-packages/radical/entk/utils/init_transition.py", line 40, in 
transition                                                                                                                        
    local_prof = profiler)                                                                                                        
  File "/Users/kristof.farkas-pall/radical-jdakka/lib/python2.7/site-packages/radical/entk/utils/sync_initiator.py", line 25, in s
ync_with_master                                                                                                                   
    correlation_id = corr_id                                                                                                      
  File "/Users/kristof.farkas-pall/radical-jdakka/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 2077, in
 basic_publish                                                                                                                    
    mandatory, immediate)                                                                                                         
  File "/Users/kristof.farkas-pall/radical-jdakka/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 2163, in
 publish                                                                                                                          
    immediate=immediate)                                                                                                          
  File "/Users/kristof.farkas-pall/radical-jdakka/lib/python2.7/site-packages/pika/channel.py", line 415, in basic_publish        
    raise exceptions.ChannelClosed()                                                                                              
ChannelClosed                                                                                                                     

Exception in thread enqueue-thread:                                                                                               
Traceback (most recent call last):                                                                                                
  File "/Users/kristof.farkas-pall/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner                          
    self.run()                                                                                                                    
  File "/Users/kristof.farkas-pall/anaconda2/lib/python2.7/threading.py", line 754, in run                                        
    self.__target(*self.__args, **self.__kwargs)
  File "/Users/kristof.farkas-pall/radical-jdakka/lib/python2.7/site-packages/radical/entk/appman/wfprocessor.py", line 338, in _$
nqueue
    raise Error(text=ex)
Error: Error:

This is the error that shows up in the logs.

jdakka commented 6 years ago

This the docker log of one of @kristofarkas runs using RabbitMQ on the RADICAL-VM2 log_rabbitMQ.log

jdakka commented 6 years ago

I tested again with the latest RC stack (RP/RU/RS) and experiments/jdakka-resubmission (EnTK).

I ran from the radical-VM but connected to a new dockerrabbitmq instance on a jetstream VM. Again this rabbit was only used for this run.

I've added the logs from the client side as well as the docker logs docker.log bad_data.zip

The sandbox is /u/sciteam/dakka/scratch/radical.pilot.sandbox/rp.session.two.jdakka.017530.0027/pilot.0000 permission is granted.

andre-merzky commented 6 years ago

the logfiles look similar: the umgr gets15 units for output staging and handles those - but then seems to hang and not attempt to get any more units.

The entk log again says that connections were dropped, but this time w/o an exception.

HtH, Andre

jdakka commented 6 years ago

@andre-merzky @vivek-bala I've included an out-of-the box test script for running the adaptive workload. https://github.com/radical-cybertools/htbac/tree/adaptive/experiments/ta_3_1. There's a README that describes dependencies, etc. Also the rabbitmq port that is referenced in the script was generated specifically for this testing has not been used. It points to docker running on the RADICAL-VM2.

FYI the dependency-links will install RP/RU/RS master branches, while EnTK uses experiments/jdakka-resubmission branch.

vivek-bala commented 6 years ago

Bump. Is this issue still encountered? If this is not an issue any more, please close the ticket.

vivek-bala commented 6 years ago

Hey @jdakka , if you don't see this issue any more, please close this ticket.

jdakka commented 6 years ago

@vivek-bala this issue was raised for the longer running units on Blue Waters. I will test the adaptive experiments from the ISC paper again which have longer running units. The last time I tested those experiments was around January 24th and the persistent heartbeat problem was still there which hindered RP termination. I'll rerun the same set. Sorry to keep this ticket open for so long.

jdakka commented 6 years ago

Blocked by 1546 So installing the VE in my radical.pilot.sandbox

FYI I have to revert to using devel in RP as I cannot use the master RP stack due to the uid problem...

jdakka commented 6 years ago

Based on today's discussion I'll test with the RP release candidate as opposed to master or development

vivek-bala commented 6 years ago

bump

jdakka commented 6 years ago

Sorry queues were long, experiment didn't complete until yesterday. This is to confirm that both long running CUs complete and adaptivity still works!

stack:

radical.entk         : 0.6.0-entk-0.5-289-g74e2b97@master
radical.pilot        : 0.47.1
radical.utils        : 0.47.1
saga                 : 0.47