radical-collaboration / hpc-workflows

NSF16514 EarthCube Project - Award Number:1639694
5 stars 0 forks source link

Error on Stampede2 while loading shared libraries #110

Closed Weiming-Hu closed 4 years ago

Weiming-Hu commented 4 years ago

I have had the following errors while trying to submit jobs to XSEDE Stampede2.

python3: error while loading shared libraries: libpython3.7m.so.1.0: cannot open shared object file: No such file or directory
python3: error while loading shared libraries: libpython3.7m.so.1.0: cannot open shared object file: No such file or directory
Resetting modules to system default. Reseting $MODULEPATH back to system default. All extra directories will be removed from $MODULEPATH.

The following have been reloaded with a version change:
  1) intel/18.0.0 => intel/18.0.2

python3: error while loading shared libraries: libpython3.7m.so.1.0: cannot open shared object file: No such file or directory

Lmod is automatically replacing "python2/2.7.15" with "python3/3.7.0".

srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 5342592.0 ON c456-001 CANCELLED AT 2020-03-06T16:48:10 ***
srun: error: c456-001: task 0: Terminated

I have loaded python3 when I specified my jobs as follows:

In workflow_cfg.yml

stage-analogs:
  executable: "anen_grib"
  pre-exec: ["module load boost netcdf python3"]
  cpu:
    processes: 1
    process-type: 'MPI'
    threads-per-process: 1
    thread-type: 'OpenMP'

Then in task_anen_gen.py

t.pre_exec = stage_cfg['pre-exec']

Please let me know how I can further debug this. Thank you.

mturilli commented 4 years ago

Solved by using module purge