Open pietromarchesi opened 6 years ago
I made a quick fix by changing line 147 of slurm.py from self.log_slurm_info(stderr) to self.log_slurm_info(str(stderr)).. It works, but I have no idea whether it will break in other circumstances.
Oops. Hmm, wonder if that is something with Python 3 that made it do a more proper distinction between strings and byte streams, or something. Could it be that the stderr in this case contains some extra (non-string) characters like for colors or something? Anyhow, it seems to me that you fix should be safe.
That said, my job still runs on the login node and does not reach the queue apparently, if you know why that may be the case, let me know.
Do you have access to the salloc
command on your login node? Do you see what exact shell command your script is trying to run (via the logs or so)? Might be useful to try to see what results you get if you run this command manually in bash... with and without salloc [relevant slurm parameters]
prepended.
So, regarding the bytes/string issue, I think you are right, it's a Python 3 thing. In particular, I found this answer, which says:
reading stdout and stdin from subprocess changed in Python 3 from
str
tobytes
. This is because Python can't be sure which encoding this uses. It probably uses the same assys.stdin.encoding
(the encoding of your system), but it can't be sure.
I have been looking at what's the best way do convert to string, some people suggest .decode('utf-8')
(like in the quoted answer), others simply .decode()
or str()
without specifying the encoding. If you have any thoughts, let me know. I'll look a bit more into it then I'll be happy to submit a PR.
Regarding the salloc
part, I think it actually works fine, because when I look at the executed jobs using sacct
, it shows that it ran on a computing node. It may that the part where I'm messing up is when I run the hostname
command, which, instead of giving me the name of the compute node, somehow returns the name of the login node (and writes login1
to bar.txt
). If I interactively log into a compute node, however, hostname
does return the name of the compute node. I'm a bit confused by this.
I have been looking at what's the best way do convert to string, some people suggest .decode('utf-8') (like in the quoted answer), others simply .decode() or str() without specifying the encoding. If you have any thoughts, let me know. I'll look a bit more into it then I'll be happy to submit a PR.
Cool, TIA!
Regarding the salloc part, I think it actually works fine, because when I look at the executed jobs using sacct, it shows that it ran on a computing node. It may that the part where I'm messing up is when I run the hostname command, which, instead of giving me the name of the compute node, somehow returns the name of the login node (and writes login1 to bar.txt). If I interactively log into a compute node, however, hostname does return the name of the compute node. I'm a bit confused by this.
Ah, yes, I think I see why: The shell expansion of the variable, will happen when the command is first issued, which is on the login node. Then SLURM will take care of executing the command on the login node (since it is prepended by salloc
), but then the shell expansion will have been done already.
Not sure what is the best way to fix that ... perhaps by putting your host-lookup in a separate shell script. Then that should only be executed once on the compute node.
Thanks for the tip on the shell expansion. I was able to fix it as you suggested by putting the command in a bash script. I will write up a draft for a wiki page with the example.
PR submitted!
Hi,
Apologies for opening so many issues today. I have adapted the wiki example for SLURM, and have changed it such that it replaces
foo
with the hostname where the job is running. Code is available at this gist. I was testing it on a SLURM cluster, and gotI made a quick fix by changing line 147 of
slurm.py
fromself.log_slurm_info(stderr)
toself.log_slurm_info(str(stderr))
.. It works, but I have no idea whether it will break in other circumstances. That said, my job still runs on the login node and does not reach the queue apparently, if you know why that may be the case, let me know.Cheers,
Pietro