radical-collaboration / extasy-bpti

0 stars 1 forks source link

Inevitable job failure when using many tmux sessions and process hanging on client side #3

Open FranklinBetten opened 6 years ago

FranklinBetten commented 6 years ago

All jobs consistently fail when using many tmux sessions (each of which is running a single unique ExTASY simulation) from the same JetStream VM. After failure the client and agent both show all jobs as cancelled but on the client one of the ExTASY execution processes is still found to be running in the process list when the command "ps faux | grep python" is run.

ps faux | grep python - output on VM hal9000 548 0.0 0.5 811848 85488 pts/10 Tl Oct03 0:36 | _ python nwexgmx_v002.py --RPconfig supermic.rcfg --Kconfig gmxcoco.wcfg

While this process is still alive all new ExTASY simulations submitted to run will fail immediately. This process remains alive (have not checked if it stays alive indefinitely but at least for several hours) until it is manually killed with the " kill proc ID " command.

This error occurs on SuperMIC but I have not tested if this occurs on other machines.

JetStream VM Stack

(extasy-tools) hal9000@js-17-187:~/Documents/jha$ radical-stack python : 2.7.12 virtualenv : /home/hal9000/Documents/jha/extasy-tools radical.utils : 0.45 saga-python : 0.45.1 radical.pilot : 0.45.3 (extasy-tools) hal9000@js-17-187:~/Documents/jha$ ensemblemd-version 0.4.6 (extasy-tools) hal9000@js-17-187:~/Documents/jha$ pyCoCo -V 0.3.2

(extasy-tools) hal9000@js-17-187:~/Documents/jha$ tmux -V tmux 2.1

andre-merzky commented 6 years ago

@FranklinBetten , can you please try to reproduce this with an RP example? That would tell us if we need to look on RP or ExTASY level. Thanks!

FranklinBetten commented 6 years ago

Looks like I never replied to this. There were two issues causing this problem.

  1. was a SuperMic policy that kills jobs when node utilization drops below a certain % - this was address and we since have stopped using SuperMic and now use BW

  2. You can Queue up as many jobs as you want and as long as the run one at a time they will not cause each other problems. However when two or more jobs are executed on BW they cause each other to fail. This seems similar to Eugenes latest issue.