Closed jdakka closed 6 years ago
Thanks Jumana. Can you point me to the EnTK script you are running as well?
So I checked both the logs on the agent and the client. It seems like not all the units reach the DONE state. The application seems to have a total of 780 tasks.
Agent:
$ cat agent_0.staging.output.0.child.log | grep UMGR_STAGING_OUTPUT | uniq | wc -l
780
Client:
$ cat umgr.0000.staging.output.0.child.log | grep DONE | uniq | wc -l
586
@andre-merzky : I might need your help in this ticket. @jdakka : Is this issue recurring for you (do you get this error everytime you try)? for completion, can you also add the radical stack you are using please?
I'm going to keep looking for any error or warnings in the logs. Just wanted to update the ticket and bring in Andre to help debug.
The DB also does not have all units in the completed state.
$ cat rp.session.two.jdakka.017524.0007.json | grep '"state": "DONE"' | wc -l
586
Not sure what's happening between the agent and the mongodb instance at this point.
There appears to be a duplication UMGR_STAGING_OUTPUT_PENDING
for all the untis
194 tasks do not reach UMGR_STAGING_OUTPUT
nwk-2-80:isc-htbac-experiments JumanaDakka$ cat /Users/JumanaDakka/Downloads/rp.session.two.jdakka.017524.0007.json | grep '"AGENT_STAGING_OUTPUT"' | wc -l
780
nwk-2-80:isc-htbac-experiments JumanaDakka$ cat /Users/JumanaDakka/Downloads/rp.session.two.jdakka.017524.0007.json | grep '"UMGR_STAGING_OUTPUT_PENDING"' | wc -l
1560
nwk-2-80:isc-htbac-experiments JumanaDakka$ cat /Users/JumanaDakka/Downloads/rp.session.two.jdakka.017524.0007.json | grep '"UMGR_STAGING_OUTPUT"' | wc -l
586
This is now urgent, @kristofarkas is facing the same issue, and I noticed a similar behavior.
nwk-2-80:rp.session.julian.chem.ucl.ac.uk.kristof.farkas-pall.017527.0000 JumanaDakka$ cat *.json | grep '"AGENT_STAGING_OUTPUT"' | wc -l
61
nwk-2-80:rp.session.julian.chem.ucl.ac.uk.kristof.farkas-pall.017527.0000 JumanaDakka$ cat *.json | grep '"UMGR_STAGING_OUTPUT_PENDING"' | wc -l
122
nwk-2-80:rp.session.julian.chem.ucl.ac.uk.kristof.farkas-pall.017527.0000 JumanaDakka$ cat *.json | grep '"UMGR_STAGING_OUTPUT"' | wc -l
62
Duplication of UMGR_STAGING_OUTPUT_PENDING One of his units has an additional UMGR_STAGING_OUTPUT rp.sess.tar.gz
I looked a bit deeper into this, unit.000060 is the one that gets stuck at UMGR_STAGING_OUTPUT
and doesn't update to DONE
. Even though he specified additional units past unit.000060 no more units are either submitted or created...depending on whether it's EnTK or RP.
The agent sandbox on Blue Waters is
/u/sciteam/farkaspa/scratch/radical.pilot.sandbox/rp.session.julian.chem.ucl.ac.uk.kristof.farkas-pall.017527.0000
2017-12-27 17:27:46,575: umgr.0000 : task-manager : umgr.0000.subscriber._state_sub_cb: INFO : [Callback]: unit unit.000060 state: UMGR_STAGING_OUTPUT.
2017-12-27 17:27:46,575: umgr.0000 : task-manager : umgr.0000.subscriber._state_sub_cb: DEBUG : umgr.0000 calls state cb unit_state_cb for unit.000060
2017-12-27 17:27:47,961: umgr.0000 : task-manager : umgr.0000.idler._unit_pull_cb: INFO : units pulled: 0
2017-12-27 17:27:49,500: umgr.0000 : task-manager : umgr.0000.idler._unit_pull_cb: INFO : units pulled: 0
2017-12-27 17:27:51,041: umgr.0000 : task-manager : umgr.0000.idler._unit_pull_cb: INFO : units pulled: 0
2017-12-27 17:27:52,571: umgr.0000 : task-manager : umgr.0000.idler._unit_pull_cb: INFO : units pulled: 0
2017-12-27 17:27:54,104: umgr.0000 : task-manager : umgr.0000.idler._unit_pull_cb: INFO : units pulled: 0
2017-12-27 17:27:55,640: umgr.0000 : task-manager : umgr.0000.idler._unit_pull_cb: INFO : units pulled: 0
Hey all,
I finally started to look into this. Jumana, could you please attach the pilot sandbox for rp.session.two.jdakka.017524.0007
? Thanks!
What I guess is that the UMGR either got stuck or killed before all units were pulled. The DB entries look ok, but the UMGR at last pulls unit.000663, then calls the unit's state callbacks, and then stalls. Either that state callback never returns, or something else stalls - I do not yet see what that could be.
My issue is not as important as @kristofarkas. His sandbox on BW is /u/sciteam/farkaspa/scratch/radical.pilot.sandbox/rp.session.julian.chem.ucl.ac.uk.kristof.farkas-pall.017527.0000
and the client logs are
rp.sess.tar.gz
But I gave access to my sandbox /u/sciteam/dakka/scratch/radical.pilot.sandbox/rp.session.two.jdakka.017524.0007/
in case you need it
For clarity: @kristofarkas' issue is the same as yours, right?
Yes we both experience hanging units.
I see the following error in the logs from Jumana and Kristoff:
Traceback (most recent call last):
File "/home/jdakka/poE/venv/local/lib/python2.7/site-packages/radical/entk/appman/appmanager.py", line 812, in _synchronizer
method_frame, props, body = mq_channel.basic_get(queue='enq-to-sync')
File "/home/jdakka/poE/venv/local/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 2034, in basic_get
self._basic_getempty_result.is_ready)
File "/home/jdakka/poE/venv/local/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 1250, in _flush_output
*waiters)
File "/home/jdakka/poE/venv/local/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 455, in _flush_output
self._impl.ioloop.poll()
File "/home/jdakka/poE/venv/local/lib/python2.7/site-packages/pika/adapters/select_connection.py", line 245, in poll
self._poller.poll()
File "/home/jdakka/poE/venv/local/lib/python2.7/site-packages/pika/adapters/select_connection.py", line 935, in poll
self._dispatch_fd_events(fd_event_map)
File "/home/jdakka/poE/venv/local/lib/python2.7/site-packages/pika/adapters/select_connection.py", line 625, in _dispatch_fd_events
handler(fileno, events)
File "/home/jdakka/poE/venv/local/lib/python2.7/site-packages/pika/adapters/base_connection.py", line 395Exception in thread synchronizer-thread:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/home/jdakka/poE/venv/local/lib/python2.7/site-packages/radical/entk/appman/appmanager.py", line 869, in _synchronizer
raise Error(text=ex)
Error: Error: (404, "NOT_FOUND - no queue 'enq-to-sync' in vhost '/'")
Seems like either the rabbitmq is unresponsive or terminated. Can you try a new rabbitmq instance? Possibly close others if you are not using them.
Also check if there are any stray processes from previous ctrl+c or ctrl+z actions and kill them if so. Let me know how it goes.
2017-12-29 09:06:39,137: radical.entk.wfprocessor: wfprocessor : enqueue-thread : ERROR : Transition of radi
cal.entk.stage.0010 to SCHEDULING state failed, error: (-1, "error(32, 'Broken pipe')")
2017-12-29 09:06:39,138: radical.entk.wfprocessor: wfprocessor : enqueue-thread : ERROR : Error while updati
ng stage state, rolling back. Error:
2017-12-29 09:06:39,139: radical.entk.wfprocessor: wfprocessor : enqueue-thread : ERROR : Transition of radi
cal.entk.stage.0010 to DESCRIBED state failed, error:
2017-12-29 09:06:39,140: radical.entk.wfprocessor: wfprocessor : enqueue-thread : ERROR : Error in enqueue-t
hread:
Traceback (most recent call last):
File "/Users/kristof.farkas-pall/radical-jdakka/lib/python2.7/site-packages/radical/entk/appman/wfprocessor.py", line 315, in _e
nqueue
logger=self._logger)
File "/Users/kristof.farkas-pall/radical-jdakka/lib/python2.7/site-packages/radical/entk/utils/init_transition.py", line 40, in
transition
local_prof = profiler)
File "/Users/kristof.farkas-pall/radical-jdakka/lib/python2.7/site-packages/radical/entk/utils/sync_initiator.py", line 25, in s
ync_with_master
correlation_id = corr_id
File "/Users/kristof.farkas-pall/radical-jdakka/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 2077, in
basic_publish
mandatory, immediate)
File "/Users/kristof.farkas-pall/radical-jdakka/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 2163, in
publish
immediate=immediate)
File "/Users/kristof.farkas-pall/radical-jdakka/lib/python2.7/site-packages/pika/channel.py", line 415, in basic_publish
raise exceptions.ChannelClosed()
ChannelClosed
Exception in thread enqueue-thread:
Traceback (most recent call last):
File "/Users/kristof.farkas-pall/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/Users/kristof.farkas-pall/anaconda2/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/Users/kristof.farkas-pall/radical-jdakka/lib/python2.7/site-packages/radical/entk/appman/wfprocessor.py", line 338, in _$
nqueue
raise Error(text=ex)
Error: Error:
This is the error that shows up in the logs.
This the docker log of one of @kristofarkas runs using RabbitMQ on the RADICAL-VM2 log_rabbitMQ.log
I tested again with the latest RC stack (RP/RU/RS) and experiments/jdakka-resubmission
(EnTK).
I ran from the radical-VM but connected to a new dockerrabbitmq
instance on a jetstream VM. Again this rabbit was only used for this run.
I've added the logs from the client side as well as the docker logs docker.log bad_data.zip
The sandbox is /u/sciteam/dakka/scratch/radical.pilot.sandbox/rp.session.two.jdakka.017530.0027/pilot.0000
permission is granted.
the logfiles look similar: the umgr gets15 units for output staging and handles those - but then seems to hang and not attempt to get any more units.
The entk log again says that connections were dropped, but this time w/o an exception.
HtH, Andre
@andre-merzky @vivek-bala I've included an out-of-the box test script for running the adaptive workload. https://github.com/radical-cybertools/htbac/tree/adaptive/experiments/ta_3_1. There's a README that describes dependencies, etc. Also the rabbitmq port that is referenced in the script was generated specifically for this testing has not been used. It points to docker running on the RADICAL-VM2.
FYI the dependency-links will install RP/RU/RS master branches, while EnTK uses experiments/jdakka-resubmission branch.
Bump. Is this issue still encountered? If this is not an issue any more, please close the ticket.
Hey @jdakka , if you don't see this issue any more, please close this ticket.
@vivek-bala this issue was raised for the longer running units on Blue Waters. I will test the adaptive experiments from the ISC paper again which have longer running units. The last time I tested those experiments was around January 24th and the persistent heartbeat problem was still there which hindered RP termination. I'll rerun the same set. Sorry to keep this ticket open for so long.
Blocked by 1546
So installing the VE in my radical.pilot.sandbox
FYI I have to revert to using devel
in RP as I cannot use the master
RP stack due to the uid
problem...
Based on today's discussion I'll test with the RP
release candidate as opposed to master
or development
bump
Sorry queues were long, experiment didn't complete until yesterday. This is to confirm that both long running CUs complete and adaptivity still works!
stack:
radical.entk : 0.6.0-entk-0.5-289-g74e2b97@master
radical.pilot : 0.47.1
radical.utils : 0.47.1
saga : 0.47
My application runs to completion, i.e. no FAILED units, however either EnTK or RP is not terminating properly. As a result rabbitmq continues to send pulses until wall times is reached.
stack:
debug.log
has the run's log, I gave you permission to access the agent files on BW:/u/sciteam/dakka/scratch/radical.pilot.sandbox/rp.session.two.jdakka.017524.0007
data_for_vivek.zip