sos status show the uname of the machine

vatlab / sos

SoS workflow system for daily data analysis

http://vatlab.github.io/sos-docs

BSD 3-Clause "New" or "Revised" License

274 stars 45 forks source link

sos status show the uname of the machine #1514

Open gaow opened 1 year ago

gaow commented 1 year ago

Currently when I run sos status for a task on the cluster, I can find the queue it runs under. However I don't know the node it was submitted to. This is sometimes quite important because when I want to reproduce an erroneous behavior I would like to send it to the exact node it failed initially. (this was in fact the case for one of our recent mysterious failure that took us a while to pin it to a hardware malfunction in a particular node). Is there a way to retrieve this information from SoS task signatures?

BoPeng commented 1 year ago

There is no built-in support for this since sos status depends on information saved by task instances to .task files, and no information about computing node is saved.

However, this could potentially be fixed by something like

task:
sh:
    echo $HOSTNAME 
    ....

if computing nodes has this environment variable.

gaow commented 1 year ago

I think at the high level, the question is how to track which computing node a task was executed at, without letting users worry about it. On slurm for example,

squeue -t R --format="%.20i %.20j %.5t %.5C %R"

shows the list of nodes for the running jobs.

In SoS , having the HOSTNAME information in the status output is what I can think of. It should be robust enough to implement as

import platform
platform.node()

but when you said

and no information about computing node is saved.

it means it couldnt be saved, or just not saved but possible to save?

If we dont have the information avaiable when the task instance is written, then one idea is to build this into the yml file for the remote host configuration? although other approaches should be equally okay as long as this can be traced easily.

BoPeng commented 1 year ago

The echo command or anything works for slurm could be inserted into the task template before the sos execute line. The output should go to ${task_id}.out (defined in the task script with options such as %PBS -o).

gaow commented 1 year ago

Thank you @BoPeng I'm fine to close this ticket if there is nothing we can possibly do at the time when task instance is saved.

BoPeng commented 1 year ago

In my template defined inhosts.yml`, I have things like

#PBS -l
#PBS -v
#PBS -o /home/{user_name}/.sos/tasks/{task}.out

module load {' '.join(modules)}
{command}

so module load is something added by the template that is executed before sos execute (the command part).

if you add the echo line there, something like

#PBS -l
#PBS -v
#PBS -o /home/{user_name}/.sos/tasks/{task}.out

echo $HOSTNAME
{command}

the hostname of the computing node would be written to /home/{user_name}/.sos/tasks/{task}.out, which will be available in the outputsection of the output of sos status -f. Is this what you were looking for?

gaow commented 1 year ago

Oh yes sorry I should have mentioned that I thought of it as well and have just put it into my template. As you can see, my template has these files saved to the current directory where the sos run command is executed. Perhaps a better practice is to write these files to ~/.sos? Usually files under ~/.sos are forgotten and the folder gets larger, and that's why I would like to have it consolidated to the sos status command under ~/.sos so I don't have to manage these err and out files. Otherwise I would rather have those files under the current directory -- but this is a separate topic. I get that the only way we can systematically collect the node information is through setting the job template.

BoPeng commented 1 year ago

The .out and .err files are "absorbed" into .task files when the tasks are completed (and the .task files will be removed after a few days or with sos purge). If you really want to write such information to somewhere else, something like echo $HOSTNAME > $HOME/{task}.host.id could do.