payu-org / payu

A workflow management tool for numerical models on the NCI computing systems
Apache License 2.0
18 stars 25 forks source link

Error using intel-mpi #336

Open aidanheerdegen opened 1 year ago

aidanheerdegen commented 1 year ago

Seems by default payu will not work with an executable compiled with intel-mpi on gadi:

https://forum.access-hive.org.au/t/error-with-payu-and-loading-modules/679

This error is thrown:

Currently Loaded Modulefiles:
 1) pbs   2) openmpi/4.1.4(default)  
MODULE ERROR DETECTED: GLOBALERR intel-mpi/2019.5.281 cannot be loaded due to a conflict.
(Detailed error information and backtrace has been suppressed, set $MODULES_ERROR_BACKTRACE to unsuppress.)

Loading intel-mpi/2019.5.281
  ERROR: intel-mpi/2019.5.281 cannot be loaded due to a conflict.
    HINT: Might try "module unload openmpi" first.
payu: Model exited with error code 127; aborting.

It's because payu assume the MPI library is openmpi and adds that to the list of automatically loaded modules:

https://github.com/payu-org/payu/blob/master/payu/experiment.py#L239

Work-around is to add this to config.yaml

mpi:
   module: intel-mpi

but should probably do something a bit better than this by default

aidanheerdegen commented 11 months ago

Follow up investigation seems that the mpirun wrapper from the intel-mpi packages doesn't support -wdir type arguments, so the model just runs in the top level directory.

https://forum.access-hive.org.au/t/error-with-payu-and-loading-modules/679/8

It may be that there used to be a mpirun wrapper script for all MPI implementations, but now they are MPI library specific?

In any case, Intel MPI probably won't work out of the box at NCI.