TypeError when parsing slurm_stderr in ex_hpc

pietromarchesi commented 6 years ago

Hi,

Apologies for opening so many issues today. I have adapted the wiki example for SLURM, and have changed it such that it replaces foo with the hostname where the job is running. Code is available at this gist. I was testing it on a SLURM cluster, and got

  File "sciluigi_slurm_example.py", line 72, in run
    out = self.out_replaced().path))
  File "/gpfs/homeb/pcp0/pcp0135/venvs/basic36/lib/python3.6/site-packages/sciluigi/slurm.py", line 131, in ex
    self.ex_hpc(command)
  File "/gpfs/homeb/pcp0/pcp0135/venvs/basic36/lib/python3.6/site-packages/sciluigi/slurm.py", line 147, in ex_hpc
    self.log_slurm_info(stderr)
  File "/gpfs/homeb/pcp0/pcp0135/venvs/basic36/lib/python3.6/site-packages/sciluigi/slurm.py", line 194, in log_slurm_info
    matches = re.search('[0-9]+', slurm_stderr)
  File "/cm/shared/global/opt/python/3.6.1/lib/python3.6/re.py", line 182, in search
    return _compile(pattern, flags).search(string)
TypeError: cannot use a string pattern on a bytes-like object

I made a quick fix by changing line 147 of slurm.py from self.log_slurm_info(stderr) to self.log_slurm_info(str(stderr)).. It works, but I have no idea whether it will break in other circumstances. That said, my job still runs on the login node and does not reach the queue apparently, if you know why that may be the case, let me know.

Cheers,

Pietro

samuell commented 6 years ago

I made a quick fix by changing line 147 of slurm.py from self.log_slurm_info(stderr) to self.log_slurm_info(str(stderr)).. It works, but I have no idea whether it will break in other circumstances.

Oops. Hmm, wonder if that is something with Python 3 that made it do a more proper distinction between strings and byte streams, or something. Could it be that the stderr in this case contains some extra (non-string) characters like for colors or something? Anyhow, it seems to me that you fix should be safe.

That said, my job still runs on the login node and does not reach the queue apparently, if you know why that may be the case, let me know.

Do you have access to the salloc command on your login node? Do you see what exact shell command your script is trying to run (via the logs or so)? Might be useful to try to see what results you get if you run this command manually in bash... with and without salloc [relevant slurm parameters] prepended.

pietromarchesi commented 6 years ago

So, regarding the bytes/string issue, I think you are right, it's a Python 3 thing. In particular, I found this answer, which says:

reading stdout and stdin from subprocess changed in Python 3 from str to bytes. This is because Python can't be sure which encoding this uses. It probably uses the same as sys.stdin.encoding (the encoding of your system), but it can't be sure.

I have been looking at what's the best way do convert to string, some people suggest .decode('utf-8') (like in the quoted answer), others simply .decode() or str() without specifying the encoding. If you have any thoughts, let me know. I'll look a bit more into it then I'll be happy to submit a PR.

Regarding the salloc part, I think it actually works fine, because when I look at the executed jobs using sacct, it shows that it ran on a computing node. It may that the part where I'm messing up is when I run the hostname command, which, instead of giving me the name of the compute node, somehow returns the name of the login node (and writes login1 to bar.txt). If I interactively log into a compute node, however, hostname does return the name of the compute node. I'm a bit confused by this.

samuell commented 6 years ago

I have been looking at what's the best way do convert to string, some people suggest .decode('utf-8') (like in the quoted answer), others simply .decode() or str() without specifying the encoding. If you have any thoughts, let me know. I'll look a bit more into it then I'll be happy to submit a PR.

Cool, TIA!

Regarding the salloc part, I think it actually works fine, because when I look at the executed jobs using sacct, it shows that it ran on a computing node. It may that the part where I'm messing up is when I run the hostname command, which, instead of giving me the name of the compute node, somehow returns the name of the login node (and writes login1 to bar.txt). If I interactively log into a compute node, however, hostname does return the name of the compute node. I'm a bit confused by this.

Ah, yes, I think I see why: The shell expansion of the variable, will happen when the command is first issued, which is on the login node. Then SLURM will take care of executing the command on the login node (since it is prepended by salloc), but then the shell expansion will have been done already.

Not sure what is the best way to fix that ... perhaps by putting your host-lookup in a separate shell script. Then that should only be executed once on the compute node.

pietromarchesi commented 6 years ago

Thanks for the tip on the shell expansion. I was able to fix it as you suggested by putting the command in a bash script. I will write up a draft for a wiki page with the example.

pietromarchesi commented 6 years ago

PR submitted!

pharmbio / sciluigi

TypeError when parsing slurm_stderr in ex_hpc #47