justanhduc / task-spooler

A scheduler for GPU/CPU tasks
https://justanhduc.github.io/2021/02/03/Task-Spooler.html
GNU General Public License v2.0
273 stars 24 forks source link

Does it support multiple users share one queue? #5

Open shaoyucheng opened 3 years ago

shaoyucheng commented 3 years ago

If true, how can i realize it, thanks

justanhduc commented 3 years ago

Hi @shaoyucheng. No, multiple users cannot share a queue. Each user will create their own server based on their UID.

shaoyucheng commented 3 years ago

Hi @shaoyucheng. No, multiple users cannot share a queue. Each user will create their own server based on their UID.

got it, i think it should be a good feature which will make your project like a enhanced version of atd service.

justanhduc commented 3 years ago

Sounds like an interesting feature. I will keep this issue open for updates of this feature.

wolfram77 commented 2 years ago

I need this too for our shared Volta GPU server.

wolfram77 commented 2 years ago

It seems i was able to setup a shared queue with $TS_SOCKET, as mentioned in TRICKS. Thanks for making task-spooler.

justanhduc commented 2 years ago

Hi @wolfram77. Yes sharing the server file can be a quick and dirty way to share the queue, but be aware that it has a lot of limitations since jobs are user-independent (like -C will erase all your colleagues' queues, and -K can be invoked by anyone).

wolfram77 commented 2 years ago

@justanhduc While trying it out yesterday i saw -K deletes the socket file. Again had to chmod it. It shouldnt be a problem, but now i put a message about in help text in the server.

fearedspark commented 2 years ago

I too would be interested in the multi-users mode, even if it means all users can kill tasks form anyone

justanhduc commented 2 years ago

Hey @fearedspark. Thanks for your interest. Indeed, there is a working prototype in the branch global. However, there's an ambiguity in setting the number of slots. Should we use the same or different number of slots for all users? What is the proper number? Or it has to be something that users should compromise? I am not able to come up with a good solution, so please suggest anything.

fearedspark commented 2 years ago

Well, I will speak about the way I'm managing it on our machine, and maybe it will provide some insight. I have it configured as as many slots as there are threads on the machine. A user starting a task defines the number of slots it takes based on the number of threads it can use. It would be nice to have a default slot size that can be configured so that when a user doesn't give a number of slots, it defaults to the max. Then each user is free to use how many slots they desire. This however works well if all the user behaves properly, which is the case for us. It could be a good idea to have a maximum number of slot allowed per user, defaulting to the max number of slots.

justanhduc commented 2 years ago

Hey @fearedspark. Yeah basically we still have to depend on the kindness of other users 😅. Then I will try to look at the prototype again and see whether I can make it stable or not. Thanks a lot for the initiatives!

kylincaster commented 1 year ago

Dear all, I already developed a multi-user version at only for cpu-only at task-spooler If you feel interesting or useful, maybe we could try to merge it back. However, I am not a expert on linux. So there are still much space and bug to be improved. Cheers

justanhduc commented 1 year ago

Dear all,

I already developed a multi-user version at only for cpu-only at task-spooler

If you feel interesting or useful, maybe we could try to merge it back.

However, I am not a expert on linux. So there are still much space and bug to be improved.

Cheers

Hey @kylincaster. Awesome! Would you mind sending a PR? I will try to review it and we can discuss more how to improve from there.

justanhduc commented 1 year ago

Dear all,

I already developed a multi-user version at only for cpu-only at task-spooler

If you feel interesting or useful, maybe we could try to merge it back.

However, I am not a expert on linux. So there are still much space and bug to be improved.

Cheers

Hey @kylincaster. Awesome! Would you mind sending a PR? I will try to review it and we can discuss more how to improve from there.

kylincaster commented 1 year ago

Dear all, I already developed a multi-user version at only for cpu-only at task-spooler If you feel interesting or useful, maybe we could try to merge it back. However, I am not a expert on linux. So there are still much space and bug to be improved. Cheers

Hey @kylincaster. Awesome! Would you mind sending a PR? I will try to review it and we can discuss more how to improve from there.

I just submit the PR. you could have a try @justanhduc

justanhduc commented 1 year ago

Hey @kylincaster. You made a PR in your fork. Could you please make the PR again in here?

kylincaster commented 1 year ago

Hey @kylincaster. You made a PR in your fork. Could you please make the PR again in here?

Ok, I have done with the full detail about the feature/bug in my work.

kylincaster commented 1 year ago

@justanhduc I found if i wanted to precisely control the task, the PID of all subprocessors needed to be known in advance. So I use a bash script to control the running state of the task. The transfering of the bash script into a C code would be hard work.

justanhduc commented 1 year ago

Hi @kylincaster. Sorry for the late reply. What do you mean by "precise control"? What is your use case thay -p is not enough?

kylincaster commented 1 year ago

Hi, @justanhduc, I mean to pause or kill a process by ts. not only the process itself, but also all subprocesses should be handled. So a revursive code is necessary to find the PID for all subprocesses

justanhduc commented 1 year ago

Hi @kylincaster. To kill or pause a process and its children, can we just simply send the signal to the whole process group like the memo here? Or is there anything I missed?

kylincaster commented 1 year ago

Hi @justanhduc I ever try to kill the process directly. Unfortunately, the stop signal is not compatible for task with the subprocesses. The following is the example script which cannot be held on by kill -stop -- -XXX command

#!/bin/bash
#

for i in {2..1000}
do
        dt=`date`
        echo "output: ${dt} $i" >> log.txt
        sleep 1
done

with ts command ts mpirun -np 1 loop.sh Only the parent process mpirun is paused rather than the bash subprocess

justanhduc commented 1 year ago

Hey @kylincaster. According to the documentation of mpirun 2.1.1 on Ubuntu 18.04, mpirun only propagates a selected number of signals. When dealing with such kinda program like mpirun, imo, ts has no authority to manipulate the created subprocesses because, well, it will violate the purpose of such program.

And specifically for your problem, be sure to check the Ubuntu version and mpirun version. If you run on 18.04 and mpirun 2.1.1 like me, I successfully stop/continue by the following commands

ts mpirun --mca orte_forward_job_control 1 -np 1 toy.sh
kill -20 $(ts -p <jobid>)  # stop the mpi process. Note that SIGSTOP does not work per documentation
kill -18 $(ts -p <jobid>)  # continue

Ps: Our discussion about sending signal seems not to be in the scope of this issue, so if you still have any problem it's better to open another ticket and we can continue there.

kylincaster commented 1 year ago

Thanks for @justanhduc's comments on the performance of mpirun. Unfortunately, it depends on the implementation of MPI. The intel mpi processes didn't forwards such signal. So my solution to this problem is the following bash code which will be called inside the task-spooler.

#!/bin/bash

# getting children generally resolves nicely at some point
get_child() {
    echo $(pgrep -laP $1 | awk '{print $1}')
}

get_children() {
    __RET=$(get_child $1)
    __CHILDREN=
    while [ -n "$__RET" ]; do
        __CHILDREN+="$__RET "
        __RET=$(get_child $__RET)
    done

    __CHILDREN=$(echo "${__CHILDREN}" | xargs | sort)

    echo "${__CHILDREN} $1"
}

if [ 1 -gt $# ]; 
then
    echo "not input PID"
    exit 1
fi

owner=`ps -o user= -p $1`
if [ -z "$owner" ]; 
then
    # echo "not a valid PID"
    exit 1
fi
pids=`get_children $1`

user=`whoami`

extra=""
if [[ "$owner" != "$user" ]]; then
    extra="sudo"
fi

for pid in ${pids}; 
do
    if [ -z $2 ]
    then
        echo "${extra} ${pid}"
    else
        ${extra} kill -s $2 ${pid}
    fi
done
sadikyalcin commented 1 year ago

It seems i was able to setup a shared queue with $TS_SOCKET, as mentioned in TRICKS. Thanks for making task-spooler.

Can you share details in how you got this setup? I've defined a socket but still can't see anything from other users... @justanhduc would you be able to help with this?

wolfram77 commented 1 year ago

I set TS_SOCKET=/tmp/ts.socket in /etc/environment and chmod 777 "$TS_SOCKET".

sadikyalcin commented 1 year ago

Thanks. I was calling tsp via a bash script - turns out environment variables aren't exposed to bash scripts by default.

What about your logs though? I've got the shared queue working but still can't access logs from tasks queued from other users.

wolfram77 commented 1 year ago

Is it tsp? I am able to see the tasks queued by other users with ts or ts -l. I store the program output with a pipe like stdbuf --output=L ts -nf -N 32 ./a.out | tee -a "a.log" from a script. Are you interested in the program output of other users?

sadikyalcin commented 1 year ago

Is it tsp? I am able to see the tasks queued by other users with ts or ts -l. I store the program output with a pipe like stdbuf --output=L ts -nf -N 32 ./a.out | tee -a "a.log" from a script. Are you interested in the program output of other users?

I run a node process - which can take {x} duration which does print progress / res. Ie, the below has an error and is run by the webserver but tmp/ts-out.1LkaYj doesn't exist for me. I run apache and ssh into the server as the same user (ubuntu).

52 finished /tmp/ts-out.1LkaYj 1 84.95/1.43/0.16 {my_command}

Screenshot 2023-08-02 at 13 17 10
wolfram77 commented 1 year ago

Could you try redirecting both stdout and stderr to a file? If that does not work for you, @justanhduc may be able to help you.

justanhduc commented 1 year ago

Hi @sadikyalcin @wolfram77. First of all, tsp is the original version, not the one in this fork. Please uninstall it using apt and install the one here using make cpu. If the same problem happens, could you see verify you have the right to write in /tmp? Also, why is the ts.socket file not in /tmp?

justanhduc commented 1 year ago

Also, if you want a proper multi-user task spooler, the fork of @kylincaster is probably a better choice.

kylincaster commented 1 year ago

Dear all,

If anyone is looking for a multi-queue task manager, you are welcome to try my fork at kylincaster/task-spooler-PLUS. It has been enhanced with numerous useful features, including multiple user support, fatal crash recovery, and processor allocation and binding.

Best regards, Kylin