radical-cybertools / radical.pilot

RADICAL-Pilot
http://radical-cybertools.github.io/radical-pilot/index.html
Other
54 stars 23 forks source link

relax env isolation to fix RAPTOR #2988

Closed andre-merzky closed 7 months ago

andre-merzky commented 1 year ago

specifically, LD_LIBRARY_PATH needs to be merged, not overwritten. see #2980

mturilli commented 1 year ago

This is meant to be a general purpose solution for the special case addressed on Summit.

andre-merzky commented 1 year ago

I looked into this a bit further. I think I see a solution but want to discuss.

The purpose of the env isolation was to allow one set of modules to be loaded for the launch method (env_lm), and another set for the task executable (env_task). Consider for example that the LM is compiled with the vendor C compiler, and the application needs gcc. The current approach solves that problem consistently and isolates the LM and task env.

The problem this issue is about is that it does that a bit too well: some LMs add settings to the environment after they have been started resulting in a changed env_lm (let's call it env_lm2), and those env settings are important to the application - for example to correctly initialize MPI.

I think the solution is the following: we should compare env_lm and env_lm2. The diff we find (either variables were added, variables were removed, or variables were changed) need then to be transplanted into the lm_task.

Added or removed variables are easy - changed variables are a bit tricky, but we should be able to handle all path-like variables (PATH, LD_LIBRARY_PATH etc.) which all have the format value_1:value_2:.... It is still a bit tricky as we need to figure out if the changes should be pre-pended or appended to the env, but a heuristic should solve that in most cases.

So the proposal is to elevate the env_diff method in radical utils (which at the moment only serves debugging purposes) and add an env_patch method which takes a diff and applies it to a target env, in our case env_task. We need to make this available on shell level (in radical-utils-env.sh which is a bit cumbersome but doable.

mtitov commented 1 year ago

For consistency I would call the env set by LM as env_lm_task (not env_lm2), since it is env which is set by LM for the provided executable, and which at the same level as our provided env env_task

Speaking about the difference between env variables set by LM and by RP, I would propose that we would extend certain env variables - for example, env_task would contain export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:<paths_provided_by_RP>.

ru.env_prep(..., preserved_env=['LD_LIBRARY_PATH'])

Probably we can pick certain env variables, or even just LD_LIBRARY_PATH for now

rcarson3 commented 1 year ago

@mtitov I think I'm running into an issue related to this for one of my studies on Summit for the ExaAM project (related to a UQ study I'm having a grad student do). I was using the develop version of the radical cybertools to try and run things, and I kept running into issues of a missing libgfortran.so. I'm guessing it might be due to the fact I load up an older module in for GCC required for my executable as part of the pre-exec portion of the task, but radical later on modules and appends things to the LD_LIBRARY_PATH which I'm guessing is what's causing the issues.

I've reverted to v1.34.0 of saga, pilot, and entk which seems to have fixed the issues I was running into.

I will say I haven't seen if this might be an issue on Frontier just as I have my student running on Summit for their studies.

mtitov commented 1 year ago

@rcarson3 hi Robert, yeah, that fix is a temporary one (and it is for GCC 9.1.0, since that what we set by default)

https://github.com/radical-cybertools/radical.pilot/pull/2992/files#diff-56e7a2035c2b5030b5b9137091ce6dca815aecc5fa143e61ba0ab506b24cbfcd

thus to avoid LD_LIBRARY_PATH prepend, you can do the following (since it is part of the platform/resource config and you can overwrite it - not setting anything or bring in your setup)

mkdir -p ~/.radical/pilot/configs
cat > ~/.radical/pilot/configs/resource_ornl.json <<EOF
{
    "summit": {
        "task_pre_exec" : []
    },
    "summit_jsrun": {
        "task_pre_exec" : []
    }
}
EOF

(*) Platform ID summit uses MPIRUN as a launch method, and summit_jsrun uses JSRUN

Overall that is a manual fix, that's why it depends on GCC version, general approach (which this ticket targets), will not be version dependent. With Frontier we didn't encounter necessity to change LD_LIBRARY_PATH

rcarson3 commented 1 year ago

Thanks and I'll take at your suggestion when I get the chance to try it out.

mturilli commented 7 months ago

Closing because of lack of use case. Workaround for Summit works.