multiscale / muscle3

The third major version of the MUltiScale Coupling Library and Environment
Apache License 2.0
25 stars 13 forks source link

Execution in login shell (or not) #201

Open LourensVeen opened 1 year ago

LourensVeen commented 1 year ago

Model components are currently executed in a login shell. This is nice, because it means the environment is the same as what you have on the command line, so there are fewer unexpected differences. On the other hand, there may be cases where different models require different things, and you want a clean environment to explicitly add modules and variables to. Also, the first case may create unexpected conflicts, because the shell scripts loaded for a login shell may fail in the presence of environment variables injected from the environment in which the manager was started by QCG-PJ.

QCG-PJ currently seems to copy various bits of the environment from that in which it is running to the jobs it runs, but as I recall it's not the same locally as on a cluster. It also always runs in a login shell, at least in a cluster but not when running locally. MUSCLE3 currently manually adds a bash -l -c to local runs to at least make it consistent.

Both of the above cases actually seem reasonable, so the solution is probably to add another key to the implementations section in the yMMSL file that specifies whether we want a login shell or a normal one, and/or a clean one or with passthrough from the host environment.

Thanks to @peter-t-fox for the report and discussion.

LourensVeen commented 1 year ago

Some additional complications have come up. In some cases, bash -l causes the program to be run in the home directory, which is definitely not what we want. Also, QCG-PJ version 0.14.0 now uses exec -l for some reason.

bash has the following ways of reading startup files when it's running non-interactively:

Note that on my Ubuntu system, the default ~/.profile includes ~/.bashrc, so that will get read by bash -l as well then.

It's not clear to me whether there's a difference between bash -c and bash -l --noprofile -c.

LourensVeen commented 4 weeks ago

Just encountered another issue: if we don't run in a login shell, then we'll inherit the environment from the parent, which is the manager or the node agent. The manager has in turn inherited the environment from the shell that started it, but not any functions defined in it. Subshells do "inherit" functions, because they're forked subprocesses of the parent shell, but when starting a Python interpreter those are lost because Python doesn't know anything about shell functions and there's no mechanism to pass them.

The problem is that the module command that we need to load modules is a shell function, and if we launch a non-login shell from Python, then it won't read the usual config files, therefore not source the environment modules or lmod configuration script, and it won't have the module function.

So we need a login shell, but the problem with that is that it can also contain other stuff that we don't want (giant active banners, commands to move to a different directory, anything really).

It seems that lmod defines a few environment variables that we may be able to use. Sourcing ${LMOD_PKG}/init/bash should define the required functions. But if we're doing lmod with Spack, then we then also need to source ${SPACK_ROOT}/share/spack/setup-env.sh after that to make the Spack-built modules available.

We could add those commands to the run script if we detect that LMOD_PKG and SPACK_ROOT are set, but what if we're using environment modules, and what if we're using EasyBuild or Nix with lmod? We'd have to figure out all those situations and add support for them one by one. And test them too...

Question: how does SLURM do this? It starts the job script, and inside the job script you can do module load just fine. But I write those with #!/bin/bash, not #!/bin/bash -l. So how does the module function get defined?

LourensVeen commented 3 weeks ago

https://lists.schedmd.com/pipermail/slurm-users/2021-January/006675.html

According to the bash man page, login shells read /etc/profile if it exists, then the first one of ~/.bash_profile, ~/.bash_login, and ~/.profile that exists. A non-login shell reads /etc/bash.bashrc and then ~/.bashrc, if they exist.

Of course, we have no idea where the cluster administrators source the module environment script, so that doesn't help.

The link above says that the module function is an exported function, and will therefore be inherited by subshells. The bash man page says that "Functions may be exported so that subshells automatically have them defined with the -f option to the export builtin." The bash manpage doesn't mention this, but it seems to use BASH_FUNC_<name> environment variables to pass functions to subshells, and that should propagate through Slurm, but also through our Python stuff.

So then why was module undefined in my test?