Closed andre-merzky closed 7 months ago
This is meant to be a general purpose solution for the special case addressed on Summit.
I looked into this a bit further. I think I see a solution but want to discuss.
The purpose of the env isolation was to allow one set of modules to be loaded for the launch method (env_lm
), and another set for the task executable (env_task
). Consider for example that the LM is compiled with the vendor C compiler, and the application needs gcc
. The current approach solves that problem consistently and isolates the LM and task env.
The problem this issue is about is that it does that a bit too well: some LMs add settings to the environment after they have been started resulting in a changed env_lm
(let's call it env_lm2
), and those env settings are important to the application - for example to correctly initialize MPI.
I think the solution is the following: we should compare env_lm
and env_lm2
. The diff we find (either variables were added, variables were removed, or variables were changed) need then to be transplanted into the lm_task
.
Added or removed variables are easy - changed variables are a bit tricky, but we should be able to handle all path-like variables (PATH
, LD_LIBRARY_PATH
etc.) which all have the format value_1:value_2:...
. It is still a bit tricky as we need to figure out if the changes should be pre-pended or appended to the env, but a heuristic should solve that in most cases.
So the proposal is to elevate the env_diff
method in radical utils (which at the moment only serves debugging purposes) and add an env_patch
method which takes a diff and applies it to a target env, in our case env_task
. We need to make this available on shell level (in radical-utils-env.sh
which is a bit cumbersome but doable.
For consistency I would call the env set by LM as env_lm_task
(not env_lm2
), since it is env which is set by LM for the provided executable, and which at the same level as our provided env env_task
Speaking about the difference between env variables set by LM and by RP, I would propose that we would extend certain env variables - for example, env_task
would contain export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:<paths_provided_by_RP>
.
ru.env_prep(..., preserved_env=['LD_LIBRARY_PATH'])
Probably we can pick certain env variables, or even just LD_LIBRARY_PATH
for now
@mtitov I think I'm running into an issue related to this for one of my studies on Summit for the ExaAM project (related to a UQ study I'm having a grad student do). I was using the develop version of the radical cybertools to try and run things, and I kept running into issues of a missing libgfortran.so
. I'm guessing it might be due to the fact I load up an older module in for GCC required for my executable as part of the pre-exec
portion of the task, but radical later on modules and appends things to the LD_LIBRARY_PATH
which I'm guessing is what's causing the issues.
I've reverted to v1.34.0
of saga, pilot, and entk
which seems to have fixed the issues I was running into.
I will say I haven't seen if this might be an issue on Frontier just as I have my student running on Summit for their studies.
@rcarson3 hi Robert, yeah, that fix is a temporary one (and it is for GCC 9.1.0, since that what we set by default)
thus to avoid LD_LIBRARY_PATH
prepend, you can do the following (since it is part of the platform/resource config and you can overwrite it - not setting anything or bring in your setup)
mkdir -p ~/.radical/pilot/configs
cat > ~/.radical/pilot/configs/resource_ornl.json <<EOF
{
"summit": {
"task_pre_exec" : []
},
"summit_jsrun": {
"task_pre_exec" : []
}
}
EOF
(*) Platform ID summit
uses MPIRUN
as a launch method, and summit_jsrun
uses JSRUN
Overall that is a manual fix, that's why it depends on GCC version, general approach (which this ticket targets), will not be version dependent. With Frontier we didn't encounter necessity to change LD_LIBRARY_PATH
Thanks and I'll take at your suggestion when I get the chance to try it out.
Closing because of lack of use case. Workaround for Summit works.
specifically,
LD_LIBRARY_PATH
needs to be merged, not overwritten. see #2980