Open EdwardSnyder-NOAA opened 10 months ago
The release container can run on Hercules and Orion with the changes listed below. These changes are needed to run the develop container on these platforms as well.
Hercules
Orion
Gaea is currently experiencing Lustre security issues, so container use on Gaea was disabled (i.e. singularity commands don't work). The containers will be enabled when the issue is patched, or when the new F5 file system is available. In addition, container use on Gaea is in a beta state.
Expected behavior
The previous release (v2.1.0) container was able to run on all T1 platforms using the instructions listed in the documentation.
Current behavior
The current release container (ubuntu20.04-intel-srwapp-release-public-v2.2.0.img) doesn't run on Cheyenne/Derecho, Gaea, and Orion/Hercules using the instructions mentioned in the previous section.
Machines affected
FATAL ERROR: ERROR IN NF90_CREATE: Permission denied
, but after updating the srun command tosrun --mpi=pmi2
, the PET log files reveal that the process is hanging on FieldRegridStore. One way to remedy this is to reduce the number of cores or to run on a single node. Unfortunately, no combination worked for me. So I ran this step on one node with thetime
command, which worked. I then ran the make_ics and make_lbcs steps with thetime
command successfully on a single node. However, the run_fcst step fails with thetime
command.mpiexec -np $nprocs
command. I was able to get the remaining steps (make_ics, make_lbcs, run_fcst, and run_post) to pass by running them on a single node with thempiexec
executable path appended to the PATH var in each one of these steps exregional scripts. Making these changes will lead to the community case to pass on Orion.I think all these issues are related and are all caused by conflicts between the native intel/mpi environment and the intel/mpi environment in the container. Also Orion has some sort of access or permissions issue because it has trouble seeing the platform variables like mpiexec and python.
Steps To Reproduce
rocotorun
until you encounter an error.