ufs-community / ufs-srweather-app

UFS Short-Range Weather Application
Other
55 stars 116 forks source link

Trouble running release v2.2.0 container on some tier 1 platforms #961

Open EdwardSnyder-NOAA opened 10 months ago

EdwardSnyder-NOAA commented 10 months ago

Expected behavior

The previous release (v2.1.0) container was able to run on all T1 platforms using the instructions listed in the documentation.

Current behavior

The current release container (ubuntu20.04-intel-srwapp-release-public-v2.2.0.img) doesn't run on Cheyenne/Derecho, Gaea, and Orion/Hercules using the instructions mentioned in the previous section.

Machines affected

 - CALL FieldRegridStore.
 - CALL FieldRegridStore.
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source
sfc_climo_gen      0000000001837F1B  Unknown               Unknown  Unknown

I think all these issues are related and are all caused by conflicts between the native intel/mpi environment and the intel/mpi environment in the container. Also Orion has some sort of access or permissions issue because it has trouble seeing the platform variables like mpiexec and python.

Steps To Reproduce

  1. Pick one of the platforms mentioned above.
  2. Follow the instructions mentioned here.
  3. Once the experiment is built, run tasks with rocotorun until you encounter an error.
EdwardSnyder-NOAA commented 9 months ago

The release container can run on Hercules and Orion with the changes listed below. These changes are needed to run the develop container on these platforms as well.

Hercules

Orion

EdwardSnyder-NOAA commented 8 months ago

Gaea is currently experiencing Lustre security issues, so container use on Gaea was disabled (i.e. singularity commands don't work). The containers will be enabled when the issue is patched, or when the new F5 file system is available. In addition, container use on Gaea is in a beta state.