norment / tsd_issues

Repo to track issues with TSD as tickets
2 stars 0 forks source link

Too old version of singularity on Colossus #22

Closed Majolund closed 3 years ago

Majolund commented 4 years ago

It would be great if it was possible to run singularity in the terminal window as it would be easier to do tests or run it for small jobs.

This has been sent as a ticket to TSD (in October 2019) where the response from TSD was that it was due to the tl nodes being old OS and kernel, and that to avoid this, the submit node should be used, but this did not work and didn't hear back after sending the error messages, so not sure what the status is/if it's closed?

This is a minor issue, but would be good to solve.

ofrei commented 4 years ago

@Majolund Wow, I didn't know that's the case. To me it sounds like a real major issue, as it's difficult to develop & debug without running stuff interactively. It really just my guess but I would imaging it was one of the reasons to enable modules environment on the login nodes, so that we could interactively develop our scripts locally before submitting large-scale jobs to the cluster.

scimerc commented 4 years ago

I think it may be a question of VM. it worked for me on tl02.

ofrei commented 4 years ago

A bit new info on this. The issue is related to the fact that the /cluster filesystem is exported with 'nosuid' set and Singularity depends on its setuid binary.

https://singularity.lbl.gov/admin-guide

To ensure proper support, it is easiest to make sure you install Singularity to a local file system.

For app nodes, the recommendation is to install locally Singularity from EPEL. This way it should be ok to run Singularity containers on appnodes.

Out of curiosity, I think we should we try to copy singularity module into a local folder on p33-submit VM (if it has some local folder - now sure). In principle this should avoid 'nosuid' issue. But I don't know if singularity runs on VMs. I know virtual machines within virtual machines don't work - e.g. can't enable HyperV within Windows Server that itself runs on Hyper V. Not sure if this also applies to singularity.

Majolund commented 4 years ago

Thanks for the update and info! Looking forward to when the app nodes are available, they ‘ll be a great resource. That’s a good idea, when running singularity on Mac this is by use of VM so think it’s worth a try. Is Sabry the go-to person to ask when it comes to possible local folders?

ofrei commented 4 years ago

@Majolund Could you first try singularity on our tl02 VM, as suggested by @scimerc ? If you need help to get singularity to work on p697-submit or on our RHEL login node - yes, please ask Sabry.

Majolund commented 4 years ago

Yes, it seems to work on tl02 VM, thanks for the tip @scimerc

ofrei commented 4 years ago

@Majolund Could you add some details, e.g. how it is that Singularity software is available on tl02? I think tl02 doesn't have access to cluster, and singularity is a module on cluster. Does tl02 has it's own singularity installed? I'm a bit confused here. Are there some links to relevant TSD documentation?

Majolund commented 4 years ago

Seems like it has its own singularity installed and that it may also be available through /cluster/etc/modulefiles. I haven't found any documentation so far for this.

scimerc commented 4 years ago

tl02 runs version 3.5.3-1.1.el6 modules aren't currently accessible but I remember Adriano was complaining about older versions of singularity a while back.

ofrei commented 4 years ago

There are few issues with singularity on TSD, but I found it to be usable.

Issues 1: image This happens on Colossus in a SLURM job where I "module load singularity/3.5.2" image If instead I use "module load singularity/2.6.1" then everything works fine.

Issue 2: There is a local installation of a singularity in /usr/bin in our p33-submit machine, and it works well. However, out of curiosity, I've tried to "module load singularity/3.5.2" and run the same command as in (1) - this fails. If I use "module load singularity/2.6.1" it fails with another error. image

Two tips to avoid issues:

  1. On Colossus (in your SLURM jobs), make sure to load singularity/2.6.1
  2. On p33-submit[2,3] and p33-appn-norment01, make sure to "module purge", and use locally installed singularity.
ofrei commented 4 years ago

We've looked further into Singularity containers on TSD with Adriano and discovered that /tsd/p33/data/home is by default mounted within container as home folder. This is not TSD-specific, but rather a default behavior of the singularity - the home folder is, by default, mounted into container. This may lead to weird behavior, e.g. if you have software within your home folder it may interfer with software installed within singularity. For this reason I do recommend to run singularit with --no-home argument.

Majolund commented 4 years ago

Great, thanks for the information and looking into this!

ofrei commented 4 years ago

https://github.com/norment/tsd_issues/issues/47 has some further info about singularity containers that attempt to write data locally. But default this doesn't work - singularity container is read-only. If software writes anything (even a log file that is not important to you) then sit may still crash with an error saying something like read only file system.

ofrei commented 4 years ago

We need Singularity 3.5 or 3.6 on Colossus. @Sandeek Could you please follow up with TSD team about this issue?

Sandeek commented 4 years ago

yes @ofrei noted

ofrei commented 4 years ago

@Sandeek Thank you! I've submitted RT ticket 4147986 to track this request.

ofrei commented 4 years ago

singularity 3.6.4 will be installed directly on all Colossus nodes.

ofrei commented 3 years ago

This is solved - singularity 3.7 is installed on all app nodes & available as a module for SLURM jobs.