Open gaow opened 1 year ago
There is no built-in support for this since sos status
depends on information saved by task instances to .task
files, and no information about computing node is saved.
However, this could potentially be fixed by something like
task:
sh:
echo $HOSTNAME
....
if computing nodes has this environment variable.
I think at the high level, the question is how to track which computing node a task was executed at, without letting users worry about it. On slurm for example,
squeue -t R --format="%.20i %.20j %.5t %.5C %R"
shows the list of nodes for the running jobs.
In SoS , having the HOSTNAME information in the status output is what I can think of. It should be robust enough to implement as
import platform
platform.node()
but when you said
and no information about computing node is saved.
it means it couldnt be saved, or just not saved but possible to save?
If we dont have the information avaiable when the task instance is written, then one idea is to build this into the yml
file for the remote host configuration? although other approaches should be equally okay as long as this can be traced easily.
The echo
command or anything works for slurm could be inserted into the task template before the sos execute
line. The output should go to ${task_id}.out
(defined in the task script with options such as %PBS -o
).
Thank you @BoPeng I'm fine to close this ticket if there is nothing we can possibly do at the time when task instance is saved.
In my template defined in
hosts.yml`, I have things like
#PBS -l
#PBS -v
#PBS -o /home/{user_name}/.sos/tasks/{task}.out
module load {' '.join(modules)}
{command}
so module load
is something added by the template that is executed before sos execute
(the command
part).
if you add the echo
line there, something like
#PBS -l
#PBS -v
#PBS -o /home/{user_name}/.sos/tasks/{task}.out
echo $HOSTNAME
{command}
the hostname of the computing node would be written to /home/{user_name}/.sos/tasks/{task}.out
, which will be available in the output
section of the output of sos status -f
. Is this what you were looking for?
Oh yes sorry I should have mentioned that I thought of it as well and have just put it into my template. As you can see, my template has these files saved to the current directory where the sos run
command is executed. Perhaps a better practice is to write these files to ~/.sos
? Usually files under ~/.sos
are forgotten and the folder gets larger, and that's why I would like to have it consolidated to the sos status
command under ~/.sos
so I don't have to manage these err
and out
files. Otherwise I would rather have those files under the current directory -- but this is a separate topic. I get that the only way we can systematically collect the node information is through setting the job template.
The .out
and .err
files are "absorbed" into .task
files when the tasks are completed (and the .task
files will be removed after a few days or with sos purge
). If you really want to write such information to somewhere else, something like echo $HOSTNAME > $HOME/{task}.host.id
could do.
Currently when I run
sos status
for a task on the cluster, I can find the queue it runs under. However I don't know the node it was submitted to. This is sometimes quite important because when I want to reproduce an erroneous behavior I would like to send it to the exact node it failed initially. (this was in fact the case for one of our recent mysterious failure that took us a while to pin it to a hardware malfunction in a particular node). Is there a way to retrieve this information from SoS task signatures?