Closed euhruska closed 6 years ago
It looks like there might be some stray processes from earlier runs which did not terminate correctly. Did you have any such runs? Killing the old tmux session probably cleaned up those processes.
I try to kill earlier processes with Ctrl+C, if that's not fast enough with Ctrl+Z.
Unfortunately, when I check with ps
I see dozens of zombie processes. Killing all my processes fixes it always.
Issuing a Ctrl-C
will tell RP to terminate all processes, and also to connect to all pilots to shut them down. Specifically the latter can be time consuming, and it might take a while until the whole stack is torn down. If you press CTRL-C
again, or suspend the process and kill it (CTRL-Z; kill %1
), then the teardown procedure itself is interrupted, and you are likely to end up with stale processes. Worse, you might end up with stale and unusable pilots wasting your allocation...
is there a possibility to have the current entk log show how all processes are shut down? I would be less inclined to CTRL-Z if I see something is happening (the shut down).
I am not sure I understand the 'how' part. Do share that, we are open to feedback on usability. Maybe we can print a msg such as "Clean termination may take up to X mins, please wait". Is that what you are suggesting?
Let me mention that EnTK, by design, does not know the architecture or implementation of RP, and neither do we inspect its internal components. We interact through the API exposed which tells us whether an RP session is closed or active. We do know the percentage of the termination event that is completed, so it is hard (maybe even an overkill) to say "X%f of termination done". But I believe if you enable the RP logs (RADICAL_PILOT_VERBOSE=DEBUG), you will see that information (written to a file, not stdout).
Does that help?
I tried waiting for CTRL+C to shut down everything properly, but it's been 30 minutes and still didn't finish.
Uh oh, that's way to long... :/ Sorry, then I need to ask for some more details:
What OS are you on? What's the RCT stack you are running right now? Can you create and pack a set of logfiles for this run (please grab them before you issue a kill
command to kill the processes manually). Thanks!
@vivek-bala yes, messages like starting termination step 1, step 1 finished, waiting for step2,..., all termination steps finished would be useful.
@andre-merzky os: Red Hat Enterprise Linux 7 radical-stack
python : 2.7.14
pythonpath :
virtualenv : extasy11
radical.analytics : v0.45.2-102-gaec2e1d@devel
radical.entk : 0.7.4-0.7.4@HEAD-detached-at-af273638
radical.pilot : 0.50.8-v0.50.8-6-g6e36785f@devel
radical.utils : 0.50.1-v0.50.1-1-gce60836@devel
saga : 0.50.0-v0.50.0@devel
logs: re.session.leonardo.rice.edu.eh22.017795.0016.zip rp.session.leonardo.rice.edu.eh22.017795.0018.zip log_extasy_tica3_villin_small.log
Thanks!
@vivek-bala , can you please have a look at the errors in log_extasy_tica3_villin_small.log
? Not sure if they screw with RPs termination sequence.
@euhruska: It seems like the pilot termination gets interrupted, but I am not sure why, yet. Could you please also attach the bootstrap*
files from the pilot sandbox? Thanks!
My main issue is that this persistent duplicate error
happens always for the villin system and I can't run this system at all because of that. I tried cleaning up the mongodb and restarting everything, but didn't help. Some other systems like BBA work at the same time. Maybe it's because of the interrupter pilot terminations.
Here the bootstrap files:
bootstrap-files.zip
update: since I moved from 'two.radical-project.org' to my own rabbitmq this issue didn't happen again.
Thanks, I think the issue might be similar to #93. Please see my comment on that. Makes sense on the rabbitmq changes too, since that minimizes any communication delays.
this error happened for a while, until I restarted my whole tmux environment, then it went away. What causes it? @vivek-bala