Cancelling commands ran by Pegasus is very difficult. You essentially have to ssh into each node and manually figure out the PIDs of commands and kill them.
Nested commands, so to say, make things more complicated. For instance, docker exec sh -c "python train.py" will run the following commands:
Ran by user: sh -c docker exec sh -c "python train.py"
Ran by user: docker exec sh -c "python train.py"
Ran by root:sh -c "python train.py"
Ran by root: python train.py
Only killing the fourth python train.py command will truely achieve cancellation. The bottom line is, it is difficult for Pegasus to infer how to properly terminate a command.
Potential solutions
We might ask the user for a cancellation command in queue.yaml. For example, sudo kill $(pgrep -f 'train.py'). Then the ctrl_c handler will create a new connection to the hosts and run the designated cancellation command.
Somehow figure out the PGID of the sh process and run sudo kill -- -PGID. Can we pgrep -f with the entire command? Shell escaping might become a problem. (pgrep -f with every single word in the command and kill the intersection of all PIDs returned?)
Cancelling commands ran by Pegasus is very difficult. You essentially have to ssh into each node and manually figure out the PIDs of commands and kill them.
Nested commands, so to say, make things more complicated. For instance,
docker exec sh -c "python train.py"
will run the following commands:sh -c docker exec sh -c "python train.py"
docker exec sh -c "python train.py"
sh -c "python train.py"
python train.py
Only killing the fourth
python train.py
command will truely achieve cancellation. The bottom line is, it is difficult for Pegasus to infer how to properly terminate a command.Potential solutions
queue.yaml
. For example,sudo kill $(pgrep -f 'train.py')
. Then the ctrl_c handler will create a new connection to the hosts and run the designated cancellation command.sh
process and runsudo kill -- -PGID
. Can wepgrep -f
with the entire command? Shell escaping might become a problem. (pgrep -f
with every single word in the command and kill the intersection of all PIDs returned?)