radical-collaboration / CyberManufacturing

CDSE Multi-scale CI Project
1 stars 0 forks source link

RP does not cancel simulation even after showing cancelling unit #43

Open csampat opened 6 years ago

csampat commented 6 years ago

the rp shifts from DEM to PBM but does not kill DEM in the process. DEM keeps running in the background.

create pilot manager                                                          ok
create unit manager                                                           ok
create pilot description [xsede.stampede2_ssh2:68]                            ok
submit 1 pilot(s)
        .                                                                     ok
add 1 pilot(s)                                                                ok
submit 1 unit(s)
        .                                                                     ok
wait for 1 unit(s)
        +                                                                     ok
submit 1 unit(s)
        .                                                                     ok
wait for 1 unit(s)
        +                                                                     ok
Canceling DEM simulation
wait for 1 unit(s)
        *                                                                     ok
submit 1 unit(s)
        .                                                                     ok
wait for 1 unit(s)
        +                                                                     ok
submit 1 unit(s)
        .                                                                     ok
wait for 1 unit(s)

when I log into the node and use top I can see the DEM process still using 64 cores and the simulation output has reached beyond the stop point

csampat commented 6 years ago

One more observation, the PBM controller does not get initiated since we use up the quota of 68 cores we specify ie 64 DEM and 4 PBM output of that unit

ls test1stampede2/pilot.0000/unit.000003/
controllerPBMDataInterpretor.py  controllerPBMDataReader.py  controllerPBMresourceMain.py  liggghts_restart.py
csampat commented 6 years ago

output of grep ERROR *.log on the client side:

(rp_couple) chai@xcalibur:~/Documents/git/coupled_rp_Cyber/src/executor/test1stampede2$ grep ERROR *.log
\control.pubsub.bridge.0000.log:2017-12-03 14:54:59,934: control.pubsub.bridge.0000: control.pubsub.bridge.0000.child: MainThread     : ERROR   : abort: KeyboardInterrupt()
log.pubsub.bridge.0000.log:2017-12-03 14:54:59,934: log.pubsub.bridge.0000: log.pubsub.bridge.0000.child    : MainThread     : ERROR   : abort: KeyboardInterrupt()
pmgr.0000.launching.0.child.log:2017-12-03 14:54:59,933: pmgr.0000.launching.0.child: pmgr.0000.launching.0           : MainThread     : ERROR   : abort: KeyboardInterrupt()
pmgr.0000.launching.0.child.log:2017-12-03 14:55:01,309: pmgr.0000.launching.0.child: pmgr.0000.launching.0           : MainThread     : ERROR   : finalization error: KeyboardInterrupt()
pmgr.launching.queue.bridge.0000.log:2017-12-03 14:54:59,933: pmgr.launching.queue.bridge.0000: pmgr.launching.queue.bridge.0000.child: MainThread     : ERROR   : abort: KeyboardInterrupt()
state.pubsub.bridge.0000.log:2017-12-03 14:54:59,933: state.pubsub.bridge.0000: state.pubsub.bridge.0000.child  : MainThread     : ERROR   : abort: KeyboardInterrupt()
umgr.0000.scheduling.0.child.log:2017-12-03 14:54:59,934: umgr.0000.scheduling.0.child: umgr.0000.scheduling.0          : MainThread     : ERROR   : abort: KeyboardInterrupt()
umgr.0000.staging.input.0.child.log:2017-12-03 14:54:59,934: umgr.0000.staging.input.0.child: umgr.0000.staging.input.0       : MainThread     : ERROR   : abort: KeyboardInterrupt()
umgr.0000.staging.output.0.child.log:2017-12-03 14:54:59,933: umgr.0000.staging.output.0.child: umgr.0000.staging.output.0      : MainThread     : ERROR   : abort: KeyboardInterrupt()
umgr.reschedule.pubsub.bridge.0000.log:2017-12-03 14:54:59,933: umgr.reschedule.pubsub.bridge.0000: umgr.reschedule.pubsub.bridge.0000.child: MainThread     : ERROR   : abort: KeyboardInterrupt()
umgr.scheduling.queue.bridge.0000.log:2017-12-03 14:54:59,933: umgr.scheduling.queue.bridge.0000: umgr.scheduling.queue.bridge.0000.child: MainThread     : ERROR   : abort: KeyboardInterrupt()
umgr.staging.input.queue.bridge.0000.log:2017-12-03 14:54:59,934: umgr.staging.input.queue.bridge.0000: umgr.staging.input.queue.bridge.0000.child: MainThread     : ERROR   : abort: KeyboardInterrupt()
umgr.staging.output.queue.bridge.0000.log:2017-12-03 14:54:59,934: umgr.staging.output.queue.bridge.0000: umgr.staging.output.queue.bridge.0000.child: MainThread     : ERROR   : abort: KeyboardInterrupt()
umgr.unschedule.pubsub.bridge.0000.log:2017-12-03 14:54:59,934: umgr.unschedule.pubsub.bridge.0000: umgr.unschedule.pubsub.bridge.0000.child: MainThread     : ERROR   : abort: KeyboardInterrupt()
update.0.child.log:2017-12-03 14:54:59,934: update.0.child      : update.0                        : MainThread     : ERROR   : abort: KeyboardInterrupt()

this is because I cancelled it. There is no output for the same on the resource side.

iparask commented 6 years ago

Okay. Here is where this is happening.

iparask commented 6 years ago

Can you upload a tarball with the logs from both the resource and the agent side? We want to see in what state exactly the unit was when the cancel signal was sent.

Also can you add your stack? It will help to make the correct fix.

csampat commented 6 years ago

client.tar.gz resource.tar.gz

csampat commented 6 years ago

Radical Stack

(rp_couple) chai@xcalibur:~/Documents/git/coupled_rp_Cyber/src/executor/test1stampede2$ radical-stack 

  python               : 2.7.14
  pythonpath           : 
  virtualenv           : rp_couple

  radical.pilot        : 0.47-v0.46.2-189-gf49364c4@experiment-cybermanufacturing
  radical.utils        : 0.47-v0.46-77-ga7b4e00@rc-v0.46.3
  saga                 : 0.47-v0.46-32-ga2f9dedc@rc-v0.46.3
iparask commented 6 years ago

Please do: radicalpilot-close-session -m export -s test1stampede2

This will produce a json file. Upload it here please

csampat commented 6 years ago

I do not get any json files. This is the error message I get.

Traceback (most recent call last):
  File "/home/chai/Documents/git/coupled_rp_Cyber/src/executor/rp_couple/bin/radicalpilot-close-session", line 246, in <module>
    mongo, db, dbname, cname, pname = ru.mongodb_connect(str(url), _DEFAULT_DBURL)
  File "/home/chai/Documents/git/coupled_rp_Cyber/src/executor/rp_couple/lib/python2.7/site-packages/radical/utils/misc.py", line 99, in mongodb_connect
    mongo = pymongo.MongoClient(host=host, port=port, ssl=ssl)
  File "/home/chai/.local/lib/python2.7/site-packages/pymongo/mongo_client.py", line 377, in __init__
    raise ConnectionFailure(str(e))
pymongo.errors.ConnectionFailure: timed out
iparask commented 6 years ago

Your RADICAL_PILOT_DBURL does not point to the correct db

csampat commented 6 years ago

sorry got the json

test1stampede2json.txt

for some reason github does not let me upload json files

andre-merzky commented 6 years ago

thanks!

csampat commented 6 years ago

changed title since similar behaviour is also observed for the PBM

iparask commented 6 years ago

I will give some context of how Compute Units (CU) are executed, before providing more information of what is happening and why.

Each CU when launched by RP is done by launching a process. RP keeps track of the ids of the processes that are used to launch a unit. From that point on and based on the unit's description more processes may or may not be launched. That creates a group of processes. When RP cancels a unit, it sends a SIGTERM signal to the group. That means that every process that is being launched while the unit is executing receives the signal.

Now the tricky part in all this. Based on how the rest of the processes are being launched that SIGTERM may or may not be passed along. For example, from initial test, we saw that when ssh receives such a signal it does not send it to any process launched by it.

For this specific case, I am not entirely sure how ibrun behaves, so I do not want to definitely say if this will work or not.

Either way, we are looking at this and we are trying to figure a way to terminate all the processes of a unit correctly and not leave things running without a good reason.

All that to say, please give another try and report here your experience!

PS: @andre-merzky please correct me if I made any mistake in the explanation

iparask commented 6 years ago

Anything new?

csampat commented 6 years ago

Hi I had taken the weekend off So I tested the RP today, it no more moves from the PBM to the DEM It receives the status files from PBM with status 1 which means which means that DEM needs to be restarted. Unfortunately, RP gets stuck there and none of the executable run even if the resources are available. But on a brighter side I think it was able to cancel the PBM units while executing mid-way. I was not able to test the DEM.

iparask commented 6 years ago

Let's do the following, please review PR #42 and after that is merged the same test should be run with the debug logs on.

iparask commented 6 years ago

Merged! Please rerun with CYBER_EXECUTOR_VERBOSE=DEBUG and RADICAL_PILOT_VERBOSE=DEBUG. After please pack the logs from the client and the remote resource and past them here please. I cannot think what may be going wrong. Also, does the first DEM get canceled?

csampat commented 6 years ago

It got stuck on the restart of DEM again. I also checked this time the 1st DEM was also not cancelled by RP. resource_log.tar.gz client_log.tar.gz

andre-merzky commented 6 years ago

That last trest tripped over an error in RP, in the new unit cancellation code. This is now fixed in the RP branch fix/issue_1510 - please update the RP installation and try once more. Thanks!

iparask commented 6 years ago

The change is pushed in the RP branch we are using

csampat commented 6 years ago

I merged the fix by @andre-merzky locally to our branch, but the processes still do not cancel. The restart works though. I shall upload a tar ball in sometime, since the sims are still on

iparask commented 6 years ago

Right now you merged the whole devel stack also to your local copy. I would like to avoid that and use only the pushes I or @andre-merzky do to the experiment/cybermanufacturing branch.

Please stop the run, pull what exists in git and try again. Let's not do large changes when we are still developing and debugging. When this is stable and we say this is it, then I, with Andre's help, will update the used stack.

Thank you

csampat commented 6 years ago

Okay sorry! I did not realize that. So, I did a fresh clone and re-ran the simulations, but still the same issue. It does not kill the previous executions.

iparask commented 6 years ago

Okay, please put the logs here to see what they have to say. Also can you provide a small setup that will allow me to test as well? Let's say a DEM that runs for 5 to 10 mins and has to be canceled, PBM and another DEM after that?

Thank you!

csampat commented 6 years ago

resource_log.tar.gz client_log.tar.gz Okay so I am uploading another branch as chaitanya/devel ... it has the files I use since a few changes were required in your file The DEM sims runs for 20 minutes and PBM runs for about 6 minutes. On the resource side also do a pull since i have update the input files.

iparask commented 6 years ago

Can you name it test/issue_43? The branch name should allow to immediately understand what is for. This is a testing branch for issue #43.

csampat commented 6 years ago

okay renamed to test/issue_43

csampat commented 6 years ago

any update on this issue @iparask ?

iparask commented 6 years ago

We started the execution manually, outside RP and tried to kill it using shell commands. This gave us an error, saying that the process did not exist. As a result we opened a ticket with TACC to pin point whether we miss something (I doubt though. @andre-merzky is shell guru) or if there is some other issue from their side.