radical-cybertools / radical.pilot

RADICAL-Pilot
http://radical-cybertools.github.io/radical-pilot/index.html
Other
54 stars 23 forks source link

RAPTOR master fails on frontera #2606

Closed AymenFJA closed 2 years ago

AymenFJA commented 2 years ago

Using an existing virtualenv is failing with the current setup of pilot _prep_env:

        pilot.prepare_env(env_name='ve_raptor',
                          env_spec={'type'   : 'virtualenv',
                                    'version': '3.7',
                                    'path'   : '/xxxx/rpilot/aymen/ve/raptor/',
                                    'pre_exec': [],
                                    'setup'  : []})

Error: python3: error while loading shared libraries: libpython3.7m.so.1.0: cannot open shared object file: No such file or directory.

Things I tried to workaround that issue and it did not work:

    "pilot_env": {
        "path": "/xxx/rpilot/aymen/ve/raptor",
        "pre_exec": [
            "export LD_LIBRARY_PATH=/xxxx/rpilot/aymen/ve/raptor/lib:$LD_LIBRARY_PATH"
        ],
        "setup": [],
        "type": "virtualenv",
        "version": "3.7"
    },

Pilot sandbox is available on Frontera under this path: /home1/xxx/xxx/aa/sandbox/rp.session.c207-020.frontera.tacc.utexas.edu.rpilot.019117.0019/pilot.0000/

andre-merzky commented 2 years ago

The VE setup script does not run in the RP virtualenv, but rather executes in the virgin job environment (see here where env/bs0_pre_0.sh is re-activated). Reasoning is that we could otherwise not create virtualenvs using different system modules, for example. So in your case you probably want to use the same pre_exec commands as are used for pre_bootstrap_0 in the resource config. Can you give this a try please?

AymenFJA commented 2 years ago

@andre-merzky, assuming I did this right:

pilot.prepare_env(env_name='ve_raptor',
                          env_spec={'type'   : 'virtualenv',
                                    'version': '3.9',
                                    'pre_exec' : [                                        "module unload intel",
                                        "module unload impi",
                                        "module load   intel",
                                        "module load   impi",
                                        "module load   python3/3.9.2"],
                                    'path'   : '/xxxx/rpilot/aymen/ve/raptor/',
                                    'pre_exec': [],
                                    'setup'  : []})

Doing this I ended up with the same error: python3: error while loading shared libraries: libpython3.7m.so.1.0: cannot open shared object file: No such file or directory

andre-merzky commented 2 years ago

Yes, that pre_exec looks right - let me try to reproduce this. Thanks.

andre-merzky commented 2 years ago

Oh wait - the module load is for python3.9, the missing library is for 3.7 - so it's using the wrong python binary.

And indeed - when you check /scratch1/07305/rpilot/aymen/ve/raptor/bin then you'll find python3.7, which corresponds to what you specified in the env_spec (`'version': '3.7'). So I may have sent you the wrong way, apologies...

But, back to square one: did you let the prepare_env create the ve or did you do it manually? If the former, then please remove the VE and let the agent try to recreate it. If the latter, you need to be very careful to specify the same set of modules in the env_spec as were active when you created the VE,

andre-merzky commented 2 years ago

This is fixed.