ufs-community / UFS_UTILS

Utilities for the NCEP models.
Other
21 stars 104 forks source link

chgres_cube consistency test failures on Orion #609

Closed GeorgeGayno-NOAA closed 2 years ago

GeorgeGayno-NOAA commented 2 years ago

Occasionally, some of the chgres_cube tests will fail with a 'bus error'. The failures are random. The system admins recommend explicitly requesting how much memory each job needs in the driver script. For example --mem=50G. Preliminary tests show this solves the problem. (The default memory on Orion allocated by Slurm is 54GB).

GeorgeGayno-NOAA commented 2 years ago

The system admins said to use this command to determine how much memory a job is using.

sacct -j 3908276 --format=jobid,jobname,state,alloctres%35,maxrss

GeorgeGayno-NOAA commented 2 years ago

Using the saact command, I adjusted the requested memory for each test (b5d6ab6). Then I tested the updated script on Orion.

All tests were successfully run six times in a row. Previously, one test (of the 16) would always fail.