underworldcode / UWGeodynamics

Underworld Geodynamics
Other
81 stars 32 forks source link

Issues when using Undeworld2 on Gadi #238

Closed Peigen-L closed 3 years ago

Peigen-L commented 3 years ago

Hello Romain or Julian,

I am trying to using Underworld2 on Gadi. With the kind help from the NCI experts, the configuration of Underworld2 has finished under the folder of: /g/data/jq14/lp5029/codes/underworld/

On well tested 2D model I got a error message from Gadi like this:

Loading python3/3.7.4
  Loading requirement: intel-mkl/2019.3.199
Currently Loaded Modulefiles:
scons/3.1.1
pbs
openmpi/4.0.2
hdf5/1.10.5p
intel-mkl/2019.3.199
python3/3.7.4
petsc/3.12.2
/apps/python3/3.7.4/lib/python3.7/site-packages/pip/_internal/commands/install.py:283: UserWarning: Disabling all use of wheels due to the use of --build-options / --global-options / --install-options.
  cmdoptions.check_install_build_global(options)
WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x1546d342ba50>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/h5py/
WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x1546d3486710>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/h5py/
WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x1546d3486f50>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/h5py/
WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x1546d3486dd0>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/h5py/
WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x1546d3486790>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/h5py/
ERROR: Could not find a version that satisfies the requirement h5py (from versions: none)
ERROR: No matching distribution found for h5py

This is the PBS file I used for initializing Underworld:

#!/bin/bash

#PBS -P jq14
#PBS -q normal
#PBS -l walltime=1:00:00
#PBS -l mem=100GB
#PBS -l jobfs=10MB
#PBS -l ncpus=12
#PBS -l software=underworld
#PBS -l wd
#PBS -l storage=gdata/jq14
#PBS -N Model

source /g/data/jq14/lp5029/codes/underworld/docs/install_guides/nci_gadi.sh

export PYTHONPATH=/home/561/lp5029/.local/lib/python3.9/site-packages:/g/data/jq14/lp5029/codes/UWGeodynamics_2.9.6/lib/python3.9/site-packages

MODELNAME="testing_model"
OUTPUTPATH=`pwd`
SCRIPT="06_SlabSubduction.py"

# execution
mpiexec python3 ./$SCRIPT 1> $OUTPUTPATH/$MODELNAME.$PBS_JOBID.log 2> $OUTPUTPATH/$MODELNAME.$PBS_JOBID.err

This is nci_gadi.sh file:

#!/bin/sh
# This script installs underworld on gadi.nci.org.au
# Note, swig will need to be installed and in your path.
# Also, swig4 doesn't seem to work, so use swig3. 
#
#
# Usage:
#  sh ./nci_gadi.sh <branch>
#
#  branch (optional): 
#     branch name to checkout, i.e. 'master'(default), 'development', 'x.y.z'

# exit when any command fails
set -e

UW_DIR=`pwd`/underworld
if [ ! -d "$UW_DIR" ] ; then
    git clone -q https://github.com/underworldcode/underworld2.git $UW_DIR
fi
cd $UW_DIR
git checkout $1  # checkout the requested version

# setup modules
module purge
RUN_MODS='pbs openmpi/4.0.2 hdf5/1.10.5p python3/3.7.4 petsc/3.12.2'
module load scons/3.1.1 $RUN_MODS
echo "*** The module list is: ***"
module list -t

# The following are probably necessary, as via the hdf5 module it is possible
# to suck in ompi3 libraries instead of the required ompi4 libs. 
export LD_PRELOAD=/apps/openmpi-mofed4.7-pbs19.2/4.0.2/lib/libmpi_usempif08_GNU.so.40:/apps/openmpi-mofed4.7-pbs19.2/4.0.2/lib/libmpi_usempi_ignore_tkr_GNU.so.40:/apps/openmpi-mofed4.7-pbs19.2/4.0.2/lib/libmpi_cxx.so.40

pip3 install --user mpi4py

export OMPI_MCA_io=ompio
export HDF5_VERSION=1.10.5
CC=h5pcc HDF5_MPI="ON" pip3 install --user --no-cache-dir --global-option=build_ext --global-option="-L/apps/hdf5/1.10.5p/lib/ompi3/" --no-binary=h5py h5py

# lavavu not supported on Gadi currently. 
#pip3 install --user lavavu

# build and install underworld
pip3 install --user -vvv .

# some messages
echo "#####################################################################"
echo "Underworld2 built                                                    "
echo "Remember to set the required environment before running Underworld2. "
echo "   module load $RUN_MODS                                             "
echo "You will also need to set the following environment variables:       "
echo "   export OMPI_MCA_io=ompio                                          "
echo "   export LD_PRELOAD=${LD_PRELOAD}:/apps/openmpi-mofed4.7-pbs19.2/4.0.2/lib/libmpi_usempif08_GNU.so.40"
echo "   export LD_PRELOAD=${LD_PRELOAD}:/apps/openmpi-mofed4.7-pbs19.2/4.0.2/lib/libmpi_usempi_ignore_tkr_GNU.so.40"
echo "   export LD_PRELOAD=${LD_PRELOAD}:/apps/openmpi-mofed4.7-pbs19.2/4.0.2/lib/libmpi_cxx.so.40"
echo "#####################################################################"

Is there anything I can do to get Underworld2 up and running on Gadi?

Kind Regards Peigen

julesghub commented 3 years ago

Hi Peigen, We have updated our gadi install scripts for the upcoming 2.11 release of underworld. Please see this example script https://github.com/underworldcode/underworld2/blob/v2.11_release/docs/install_guides/nci_gadi/gadi.sh

We also have a machine accessible installation available at /g/data/m18/codes/ on gadi.

Peigen-L commented 3 years ago

Thanks for this Julian. Is "a machine accessible installation available" means I can directly use the sample.pbs for model running with some changes in only PBS options, foodar.py and model name?

julesghub commented 3 years ago

@Peigen-L Yes you should be able to use sample.pbs. Let me know if you have issues with it.

Please do not change any pip models in the virtualenv. If you need custom python packages then install them in your own directory space and use PYTHONPATH to reference them.

Peigen-L commented 3 years ago

@julesghub Hi Julian. Thank you for the sharing. I have run a small testing model using the sample.pbs And on testing model I got a error message from Gadi: /local/spool/pbs/mom_priv/jobs/25158339.gadi-pbs.SC: line 12: /g/data/m18/codes/UWGeodynamics_2.10.0.sh: No such file or directory And this is the PBS I used:

#!/bin/bash
#PBS -P jq14
#PBS -q normal
#PBS -l walltime=48:00:00
#PBS -l mem=100GB
#PBS -l jobfs=100GB
#PBS -l ncpus=128
#PBS -l software=underworld
#PBS -l wd
#PBS -l storage=gdata/jq14

source /g/data/m18/codes/UWGeodynamics_2.10.0.sh
export PYTHONPATH=/home/561/lp5029/.local/lib/python3.9/site-packages:/g/data/jq14/lp5029/codes/UWGeodynamics_2.9.6/lib/python3.9/site-packages

MODELNAME="3D_testing"
OUTPUTPATH=`pwd`
SCRIPT="/scratch/jq14/lp5029/test/Upper_smaller_model.py"

export OPENBLAS_NUM_THREADS=1
# execution
mpiexec python3 ./$SCRIPT 1> $OUTPUTPATH/$MODELNAME.$PBS_JOBID.log 2> $OUTPUTPATH/$MODELNAME.$PBS_JOBID.err

And can I source the m18 folder under /g/data since I am not in the group? Please help

Regards Peigen

julesghub commented 3 years ago

Change this line #PBS -l storage=gdata/jq14 to #PBS -l storage=gdata/m18

That will give your job's compute nodes access to the /g/data directory. You should have read/execute access to it.

Peigen-L commented 3 years ago

After change the line to storage=gdata/m18. The error remains the same as: /local/spool/pbs/mom_priv/jobs/25162236.gadi-pbs.SC: line 12: /g/data/m18/codes/UWGeodynamics_2.10.0.sh: No such file or directory. And I can't find the folder m18 under /g/data/ Please help

julesghub commented 3 years ago

Can you run source /g/data/m18/codes/UWGeodynamics_2.10.0.sh from the commandline? Does the virtualenv correctly start up? You can check with pip list, it should contain underworld and UWGeodynamics

Peigen-L commented 3 years ago

Sorry, I can't source this file from my side. bash: /g/data/m18/codes/UWGeodynamics_2.10.0.sh: No such file or directory

julesghub commented 3 years ago

Oh I recall now, we created a group on Gadi called underworld. And only users within that group have read/execute access. Either we: 1) Get you into that group. 2) you can use the install scripts gadi.sh in the github link I sent you before.

Peigen-L commented 3 years ago

Can you added me into the group?

julesghub commented 3 years ago

I'll need your gadi username.

Peigen-L commented 3 years ago

I'll need your gadi username.

lp5029

julesghub commented 3 years ago

Request for NCI admin to add you to underworld group made.

Peigen-L commented 3 years ago

Thanks!

Peigen-L commented 3 years ago

Request for underworld group has been made. Please approve it. Thank you.

Peigen-L commented 3 years ago

@julesghub Hi, Julian. Thanks for the invitation on Gadi. I have joined in the software group underworld on NCI. However, I still have trouble in sourcing the folder m18:

source /g/data/m18/codes/UWGeodynamics_2.10.0.sh
bash: /g/data/m18/codes/UWGeodynamics_2.10.0.sh: No such file or directory

This is the information I got from NCI experts:

hello luo,

i cannot see you are part of the m18 project group. please go here to join the group: https://my.nci.org.au/mancini/project/m18/join

and then let the CI of the project know so they can approve your request.

subsequently you will need to logout and login to see project m18 in your /g/data directory.

regards, javed

Should I request again to join m18: Instabilities in the convecting mantle and lithosphere?

Peigen-L commented 3 years ago

@julesghub Hi, Julian. Thanks again for letting me in to the group m18. I can source the files in m18 now and virtualenv start up properly like: (UWGeodynamics_2.10.2) [lp5029@gadi-login-09 test]$ By checking the pip list I can see underworld 2.10.1b0 and UWGeodynamics 2.10.2 in the package list.

However, when I am using sample.pbs to start running the model I got some error message from Gadi like:

Traceback (most recent call last):
  File "/g/data/m18/codes/UWGeodynamics_2.10.2/lib/python3.7/site-packages/UWGeodynamics/__init__.py", line 5, in <module>
    import underworld
  File "/home/561/lp5029/.local/lib/python3.9/site-packages/underworld/__init__.py", line 52, in <module>
    import h5py as _h5py
  File "/home/561/lp5029/.local/lib/python3.9/site-packages/h5py/__init__.py", line 25, in <module>
    from . import _errors
ImportError: cannot import name '_errors' from 'h5py' (/home/561/lp5029/.local/lib/python3.9/site-packages/h5py/__init__.py)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./Upper_smaller_model.py", line 7, in <module>
    import UWGeodynamics as GEO
  File "/g/data/m18/codes/UWGeodynamics_2.10.2/lib/python3.7/site-packages/UWGeodynamics/__init__.py", line 8, in <module>
    raise ImportError("Can not find Underworld, please check your installation")
ImportError: Can not find Underworld, please check your installation

And this is the PBS I used:

#!/bin/bash
#PBS -P jq14
#PBS -q normal
#PBS -l walltime=48:00:00
#PBS -l mem=100GB
#PBS -l jobfs=100GB
#PBS -l ncpus=128
#PBS -l software=underworld
#PBS -l wd
#PBS -l storage=gdata/jq14

source /g/data/m18/codes/UWGeodynamics_2.10.0.sh
export PYTHONPATH=/home/561/lp5029/.local/lib/python3.9/site-packages:/g/data/jq14/lp5029/codes/UWGeodynamics_2.9.6/lib/python3.9/site-packages

MODELNAME="3D_testing"
OUTPUTPATH=`pwd`
SCRIPT="/scratch/jq14/lp5029/test/Upper_smaller_model.py"

export OPENBLAS_NUM_THREADS=1
# execution
mpiexec python3 ./$SCRIPT 1> $OUTPUTPATH/$MODELNAME.$PBS_JOBID.log 2> $OUTPUTPATH/$MODELNAME.$PBS_JOBID.err

I used qsub sample.pbs and got these error massage from Gadi log. Please help

Peigen

julesghub commented 3 years ago

get rid of export PYTHONPATH=/home/561/lp5029/.local/lib/python3.9/site-packages:/g/data/jq14/lp5029/codes/UWGeodynamics_2.9.6/lib/python

julesghub commented 3 years ago

it's causing an issue because of the h5py install in your .local directory.

Peigen-L commented 3 years ago

Thanks! It is working now! The testing results come out correctly!