Open shaoyucheng opened 3 years ago
Hi @shaoyucheng. No, multiple users cannot share a queue. Each user will create their own server based on their UID.
Hi @shaoyucheng. No, multiple users cannot share a queue. Each user will create their own server based on their UID.
got it, i think it should be a good feature which will make your project like a enhanced version of atd service.
Sounds like an interesting feature. I will keep this issue open for updates of this feature.
I need this too for our shared Volta GPU server.
It seems i was able to setup a shared queue with $TS_SOCKET, as mentioned in TRICKS. Thanks for making task-spooler.
Hi @wolfram77. Yes sharing the server file can be a quick and dirty way to share the queue, but be aware that it has a lot of limitations since jobs are user-independent (like -C
will erase all your colleagues' queues, and -K
can be invoked by anyone).
@justanhduc While trying it out yesterday i saw -K deletes the socket file. Again had to chmod it. It shouldnt be a problem, but now i put a message about in help text in the server.
I too would be interested in the multi-users mode, even if it means all users can kill tasks form anyone
Hey @fearedspark. Thanks for your interest. Indeed, there is a working prototype in the branch global
. However, there's an ambiguity in setting the number of slots. Should we use the same or different number of slots for all users? What is the proper number? Or it has to be something that users should compromise? I am not able to come up with a good solution, so please suggest anything.
Well, I will speak about the way I'm managing it on our machine, and maybe it will provide some insight. I have it configured as as many slots as there are threads on the machine. A user starting a task defines the number of slots it takes based on the number of threads it can use. It would be nice to have a default slot size that can be configured so that when a user doesn't give a number of slots, it defaults to the max. Then each user is free to use how many slots they desire. This however works well if all the user behaves properly, which is the case for us. It could be a good idea to have a maximum number of slot allowed per user, defaulting to the max number of slots.
Hey @fearedspark. Yeah basically we still have to depend on the kindness of other users 😅. Then I will try to look at the prototype again and see whether I can make it stable or not. Thanks a lot for the initiatives!
Dear all, I already developed a multi-user version at only for cpu-only at task-spooler If you feel interesting or useful, maybe we could try to merge it back. However, I am not a expert on linux. So there are still much space and bug to be improved. Cheers
Dear all,
I already developed a multi-user version at only for cpu-only at task-spooler
If you feel interesting or useful, maybe we could try to merge it back.
However, I am not a expert on linux. So there are still much space and bug to be improved.
Cheers
Hey @kylincaster. Awesome! Would you mind sending a PR? I will try to review it and we can discuss more how to improve from there.
Dear all,
I already developed a multi-user version at only for cpu-only at task-spooler
If you feel interesting or useful, maybe we could try to merge it back.
However, I am not a expert on linux. So there are still much space and bug to be improved.
Cheers
Hey @kylincaster. Awesome! Would you mind sending a PR? I will try to review it and we can discuss more how to improve from there.
Dear all, I already developed a multi-user version at only for cpu-only at task-spooler If you feel interesting or useful, maybe we could try to merge it back. However, I am not a expert on linux. So there are still much space and bug to be improved. Cheers
Hey @kylincaster. Awesome! Would you mind sending a PR? I will try to review it and we can discuss more how to improve from there.
I just submit the PR. you could have a try @justanhduc
Hey @kylincaster. You made a PR in your fork. Could you please make the PR again in here?
Hey @kylincaster. You made a PR in your fork. Could you please make the PR again in here?
Ok, I have done with the full detail about the feature/bug in my work.
@justanhduc I found if i wanted to precisely control the task, the PID of all subprocessors needed to be known in advance. So I use a bash script to control the running state of the task. The transfering of the bash script into a C code would be hard work.
Hi @kylincaster. Sorry for the late reply. What do you mean by "precise control"? What is your use case thay -p
is not enough?
Hi, @justanhduc, I mean to pause or kill a process by ts. not only the process itself, but also all subprocesses should be handled. So a revursive code is necessary to find the PID for all subprocesses
Hi @kylincaster. To kill or pause a process and its children, can we just simply send the signal to the whole process group like the memo here? Or is there anything I missed?
Hi @justanhduc I ever try to kill
the process directly. Unfortunately, the stop signal is not compatible for task with the subprocesses. The following is the example script which cannot be held on by kill -stop -- -XXX
command
#!/bin/bash
#
for i in {2..1000}
do
dt=`date`
echo "output: ${dt} $i" >> log.txt
sleep 1
done
with ts command ts mpirun -np 1 loop.sh
Only the parent process mpirun is paused rather than the bash subprocess
Hey @kylincaster. According to the documentation of mpirun
2.1.1 on Ubuntu 18.04, mpirun
only propagates a selected number of signals. When dealing with such kinda program like mpirun
, imo, ts
has no authority to manipulate the created subprocesses because, well, it will violate the purpose of such program.
And specifically for your problem, be sure to check the Ubuntu version and mpirun
version. If you run on 18.04 and mpirun
2.1.1 like me, I successfully stop/continue by the following commands
ts mpirun --mca orte_forward_job_control 1 -np 1 toy.sh
kill -20 $(ts -p <jobid>) # stop the mpi process. Note that SIGSTOP does not work per documentation
kill -18 $(ts -p <jobid>) # continue
Ps: Our discussion about sending signal seems not to be in the scope of this issue, so if you still have any problem it's better to open another ticket and we can continue there.
Thanks for @justanhduc's comments on the performance of mpirun. Unfortunately, it depends on the implementation of MPI. The intel mpi processes didn't forwards such signal. So my solution to this problem is the following bash code which will be called inside the task-spooler.
#!/bin/bash
# getting children generally resolves nicely at some point
get_child() {
echo $(pgrep -laP $1 | awk '{print $1}')
}
get_children() {
__RET=$(get_child $1)
__CHILDREN=
while [ -n "$__RET" ]; do
__CHILDREN+="$__RET "
__RET=$(get_child $__RET)
done
__CHILDREN=$(echo "${__CHILDREN}" | xargs | sort)
echo "${__CHILDREN} $1"
}
if [ 1 -gt $# ];
then
echo "not input PID"
exit 1
fi
owner=`ps -o user= -p $1`
if [ -z "$owner" ];
then
# echo "not a valid PID"
exit 1
fi
pids=`get_children $1`
user=`whoami`
extra=""
if [[ "$owner" != "$user" ]]; then
extra="sudo"
fi
for pid in ${pids};
do
if [ -z $2 ]
then
echo "${extra} ${pid}"
else
${extra} kill -s $2 ${pid}
fi
done
It seems i was able to setup a shared queue with $TS_SOCKET, as mentioned in TRICKS. Thanks for making task-spooler.
Can you share details in how you got this setup? I've defined a socket but still can't see anything from other users... @justanhduc would you be able to help with this?
I set TS_SOCKET=/tmp/ts.socket
in /etc/environment
and chmod 777 "$TS_SOCKET"
.
Thanks. I was calling tsp
via a bash script - turns out environment variables aren't exposed to bash scripts by default.
What about your logs though? I've got the shared queue working but still can't access logs from tasks queued from other users.
Is it tsp
? I am able to see the tasks queued by other users with ts
or ts -l
. I store the program output with a pipe like stdbuf --output=L ts -nf -N 32 ./a.out | tee -a "a.log"
from a script. Are you interested in the program output of other users?
Is it
tsp
? I am able to see the tasks queued by other users withts
orts -l
. I store the program output with a pipe likestdbuf --output=L ts -nf -N 32 ./a.out | tee -a "a.log"
from a script. Are you interested in the program output of other users?
I run a node process - which can take {x} duration which does print progress / res. Ie, the below has an error and is run by the webserver but tmp/ts-out.1LkaYj
doesn't exist for me. I run apache and ssh into the server as the same user (ubuntu
).
52 finished /tmp/ts-out.1LkaYj 1 84.95/1.43/0.16 {my_command}
Could you try redirecting both stdout and stderr to a file? If that does not work for you, @justanhduc may be able to help you.
Hi @sadikyalcin @wolfram77. First of all, tsp
is the original version, not the one in this fork. Please uninstall it using apt
and install the one here using make cpu
. If the same problem happens, could you see verify you have the right to write in /tmp
? Also, why is the ts.socket
file not in /tmp
?
Also, if you want a proper multi-user task spooler, the fork of @kylincaster is probably a better choice.
Dear all,
If anyone is looking for a multi-queue task manager, you are welcome to try my fork at kylincaster/task-spooler-PLUS. It has been enhanced with numerous useful features, including multiple user support, fatal crash recovery, and processor allocation and binding.
Best regards, Kylin
If true, how can i realize it, thanks