Open csampat opened 6 years ago
One more observation, the PBM controller does not get initiated since we use up the quota of 68 cores we specify ie 64 DEM and 4 PBM output of that unit
ls test1stampede2/pilot.0000/unit.000003/
controllerPBMDataInterpretor.py controllerPBMDataReader.py controllerPBMresourceMain.py liggghts_restart.py
output of grep ERROR *.log
on the client side:
(rp_couple) chai@xcalibur:~/Documents/git/coupled_rp_Cyber/src/executor/test1stampede2$ grep ERROR *.log
\control.pubsub.bridge.0000.log:2017-12-03 14:54:59,934: control.pubsub.bridge.0000: control.pubsub.bridge.0000.child: MainThread : ERROR : abort: KeyboardInterrupt()
log.pubsub.bridge.0000.log:2017-12-03 14:54:59,934: log.pubsub.bridge.0000: log.pubsub.bridge.0000.child : MainThread : ERROR : abort: KeyboardInterrupt()
pmgr.0000.launching.0.child.log:2017-12-03 14:54:59,933: pmgr.0000.launching.0.child: pmgr.0000.launching.0 : MainThread : ERROR : abort: KeyboardInterrupt()
pmgr.0000.launching.0.child.log:2017-12-03 14:55:01,309: pmgr.0000.launching.0.child: pmgr.0000.launching.0 : MainThread : ERROR : finalization error: KeyboardInterrupt()
pmgr.launching.queue.bridge.0000.log:2017-12-03 14:54:59,933: pmgr.launching.queue.bridge.0000: pmgr.launching.queue.bridge.0000.child: MainThread : ERROR : abort: KeyboardInterrupt()
state.pubsub.bridge.0000.log:2017-12-03 14:54:59,933: state.pubsub.bridge.0000: state.pubsub.bridge.0000.child : MainThread : ERROR : abort: KeyboardInterrupt()
umgr.0000.scheduling.0.child.log:2017-12-03 14:54:59,934: umgr.0000.scheduling.0.child: umgr.0000.scheduling.0 : MainThread : ERROR : abort: KeyboardInterrupt()
umgr.0000.staging.input.0.child.log:2017-12-03 14:54:59,934: umgr.0000.staging.input.0.child: umgr.0000.staging.input.0 : MainThread : ERROR : abort: KeyboardInterrupt()
umgr.0000.staging.output.0.child.log:2017-12-03 14:54:59,933: umgr.0000.staging.output.0.child: umgr.0000.staging.output.0 : MainThread : ERROR : abort: KeyboardInterrupt()
umgr.reschedule.pubsub.bridge.0000.log:2017-12-03 14:54:59,933: umgr.reschedule.pubsub.bridge.0000: umgr.reschedule.pubsub.bridge.0000.child: MainThread : ERROR : abort: KeyboardInterrupt()
umgr.scheduling.queue.bridge.0000.log:2017-12-03 14:54:59,933: umgr.scheduling.queue.bridge.0000: umgr.scheduling.queue.bridge.0000.child: MainThread : ERROR : abort: KeyboardInterrupt()
umgr.staging.input.queue.bridge.0000.log:2017-12-03 14:54:59,934: umgr.staging.input.queue.bridge.0000: umgr.staging.input.queue.bridge.0000.child: MainThread : ERROR : abort: KeyboardInterrupt()
umgr.staging.output.queue.bridge.0000.log:2017-12-03 14:54:59,934: umgr.staging.output.queue.bridge.0000: umgr.staging.output.queue.bridge.0000.child: MainThread : ERROR : abort: KeyboardInterrupt()
umgr.unschedule.pubsub.bridge.0000.log:2017-12-03 14:54:59,934: umgr.unschedule.pubsub.bridge.0000: umgr.unschedule.pubsub.bridge.0000.child: MainThread : ERROR : abort: KeyboardInterrupt()
update.0.child.log:2017-12-03 14:54:59,934: update.0.child : update.0 : MainThread : ERROR : abort: KeyboardInterrupt()
this is because I cancelled it. There is no output for the same on the resource side.
Can you upload a tarball with the logs from both the resource and the agent side? We want to see in what state exactly the unit was when the cancel
signal was sent.
Also can you add your stack? It will help to make the correct fix.
Radical Stack
(rp_couple) chai@xcalibur:~/Documents/git/coupled_rp_Cyber/src/executor/test1stampede2$ radical-stack
python : 2.7.14
pythonpath :
virtualenv : rp_couple
radical.pilot : 0.47-v0.46.2-189-gf49364c4@experiment-cybermanufacturing
radical.utils : 0.47-v0.46-77-ga7b4e00@rc-v0.46.3
saga : 0.47-v0.46-32-ga2f9dedc@rc-v0.46.3
Please do: radicalpilot-close-session -m export -s test1stampede2
This will produce a json file. Upload it here please
I do not get any json files. This is the error message I get.
Traceback (most recent call last):
File "/home/chai/Documents/git/coupled_rp_Cyber/src/executor/rp_couple/bin/radicalpilot-close-session", line 246, in <module>
mongo, db, dbname, cname, pname = ru.mongodb_connect(str(url), _DEFAULT_DBURL)
File "/home/chai/Documents/git/coupled_rp_Cyber/src/executor/rp_couple/lib/python2.7/site-packages/radical/utils/misc.py", line 99, in mongodb_connect
mongo = pymongo.MongoClient(host=host, port=port, ssl=ssl)
File "/home/chai/.local/lib/python2.7/site-packages/pymongo/mongo_client.py", line 377, in __init__
raise ConnectionFailure(str(e))
pymongo.errors.ConnectionFailure: timed out
Your RADICAL_PILOT_DBURL
does not point to the correct db
thanks!
changed title since similar behaviour is also observed for the PBM
I will give some context of how Compute Units (CU) are executed, before providing more information of what is happening and why.
Each CU when launched by RP is done by launching a process. RP keeps track of the ids
of the processes that are used to launch a unit. From that point on and based on the unit's description more processes may or may not be launched. That creates a group of processes. When RP cancels a unit, it sends a SIGTERM
signal to the group. That means that every process that is being launched while the unit is executing receives the signal.
Now the tricky part in all this. Based on how the rest of the processes are being launched that SIGTERM
may or may not be passed along. For example, from initial test, we saw that when ssh
receives such a signal it does not send it to any process launched by it.
For this specific case, I am not entirely sure how ibrun
behaves, so I do not want to definitely say if this will work or not.
Either way, we are looking at this and we are trying to figure a way to terminate all the processes of a unit correctly and not leave things running without a good reason.
All that to say, please give another try and report here your experience!
PS: @andre-merzky please correct me if I made any mistake in the explanation
Anything new?
Hi I had taken the weekend off So I tested the RP today, it no more moves from the PBM to the DEM It receives the status files from PBM with status 1 which means which means that DEM needs to be restarted. Unfortunately, RP gets stuck there and none of the executable run even if the resources are available. But on a brighter side I think it was able to cancel the PBM units while executing mid-way. I was not able to test the DEM.
Let's do the following, please review PR #42 and after that is merged the same test should be run with the debug logs on.
Merged! Please rerun with CYBER_EXECUTOR_VERBOSE=DEBUG
and RADICAL_PILOT_VERBOSE=DEBUG
. After please pack the logs from the client and the remote resource and past them here please. I cannot think what may be going wrong. Also, does the first DEM get canceled?
It got stuck on the restart of DEM again. I also checked this time the 1st DEM was also not cancelled by RP. resource_log.tar.gz client_log.tar.gz
That last trest tripped over an error in RP, in the new unit cancellation code. This is now fixed in the RP branch fix/issue_1510
- please update the RP installation and try once more. Thanks!
The change is pushed in the RP branch we are using
I merged the fix by @andre-merzky locally to our branch, but the processes still do not cancel. The restart works though. I shall upload a tar ball in sometime, since the sims are still on
Right now you merged the whole devel stack also to your local copy. I would like to avoid that and use only the pushes I or @andre-merzky do to the experiment/cybermanufacturing branch.
Please stop the run, pull what exists in git and try again. Let's not do large changes when we are still developing and debugging. When this is stable and we say this is it, then I, with Andre's help, will update the used stack.
Thank you
Okay sorry! I did not realize that. So, I did a fresh clone and re-ran the simulations, but still the same issue. It does not kill the previous executions.
Okay, please put the logs here to see what they have to say. Also can you provide a small setup that will allow me to test as well? Let's say a DEM that runs for 5 to 10 mins and has to be canceled, PBM and another DEM after that?
Thank you!
resource_log.tar.gz client_log.tar.gz Okay so I am uploading another branch as chaitanya/devel ... it has the files I use since a few changes were required in your file The DEM sims runs for 20 minutes and PBM runs for about 6 minutes. On the resource side also do a pull since i have update the input files.
Can you name it test/issue_43? The branch name should allow to immediately understand what is for. This is a testing branch for issue #43.
okay renamed to test/issue_43
any update on this issue @iparask ?
We started the execution manually, outside RP and tried to kill it using shell commands. This gave us an error, saying that the process did not exist. As a result we opened a ticket with TACC to pin point whether we miss something (I doubt though. @andre-merzky is shell guru) or if there is some other issue from their side.
the rp shifts from DEM to PBM but does not kill DEM in the process. DEM keeps running in the background.
when I log into the node and use
top
I can see the DEM process still using 64 cores and the simulation output has reached beyond the stop point