ufs-community / ufs-mrweather-app

UFS Medium-Range Weather Application
Other
23 stars 23 forks source link

RT on cheyenne: setting stack size #195

Closed llpcarson closed 3 years ago

llpcarson commented 3 years ago

While attempting to track down intermittent crashes, I tried to set the stacksize for the RT to a fixed limit (vs unlimited). When doing so, this causes the compile step to fail - the compilers seg-fault (both icc and ifort)!

The original code runs OK (so not a system problem, it seems) Removing the RLIMIT_STACK section also crashes in the same way.

Details: I changed the stacksize limit in CIME, and find a very odd error.

/glade/scratch/carson/ufs/mrw.test/stack/ufs-mrweather-app/cime/scripts (and related). in cime/config/ufs/machines/config_machines.xml,

run-dirs are at:

/glade/scratch/carson/ufs/mrw.test/stack/*

Errors are at compile-time:

Case dir: /glade/scratch/carson/ufs/mrw.test/stack/SMS_Lh3.C96.GFSv15p2.cheyenne_intel.20200914_152815_3okd1u

Errors are: _Building test for SMS in directory /glade/scratch/carson/ufs/mrw.test/stack/SMS_Lh3.C96.GFSv15p2.cheyenne_intel.20200914_152815_3okd1u ERROR: /glade/scratch/carson/ufs/mrw.test/stack/ufs-mrweather-app/cime/src/build_scripts/buildlib.cprnc FAILED, cat /glade/scratch/carson/ufs/mrw.test/stack/SMS_Lh3.C96.GFSv15p2.cheyenne_intel.20200914_1528153okd1u/bld/cprnc.bldlog.200914-152846

and that log file has the compiler-seg-fault errors I thought were related to a machine problem...

/glade/u/apps/ch/opt/ncarcompilers/0.5.0/intel/19.0.5/ifort: line 116: 4103 Segmentation fault

llpcarson commented 3 years ago

Based on this: RLIMIT_STACK The maximum size of the process stack, in bytes. Upon reaching this limit, a SIGSEGV signal is generated.

I'm going to speculate that the stacksize number is not being used correctly, effectively setting a stacksize of zero (maybe?)... hence even the compiler is seg-faulting?

I tried a single-test case and ran case.build -v --- and didn't get much more useful information.

llpcarson commented 3 years ago

The correct value does get into:

env_mach_specific.xml

so - maybe my speculation is a false lead!

uturuncoglu commented 3 years ago

You could make your changes in env_mach_specific.xml (it is in case folder) and issue ./case.setup --reset --keep env_mach_specific.xml and try to build again.

climbfuji commented 3 years ago

Based on this: RLIMIT_STACK The maximum size of the process stack, in bytes. Upon reaching this limit, a SIGSEGV signal is generated.

I'm going to speculate that the stacksize number is not being used correctly, effectively setting a stacksize of zero (maybe?)... hence even the compiler is seg-faulting?

I tried a single-test case and ran case.build -v --- and didn't get much more useful information.

According to your comment RLIMIT_STACK is in bytes. However, the limit reported by bash via ulimit -s is in kbytes (example from stampede, values don't matter):

...
stack size              (kbytes, -s) unlimited
...

So it seems you should try 300000 - which is the value reported on cheyenne using ulimit s and corresponds to a little less than 300MB - TIMES 1024 if RLIMIT_STACK is really in bytes.

llpcarson commented 3 years ago

Using "bytes", the compile is running OK so far. I will start a full RT shortly to see if this fixes the other intermittent errors.

uturuncoglu commented 3 years ago

@llpcarson that is great.

llpcarson commented 3 years ago

After fixing the units of the stacksize (!), the tests compile OK, and have the same 2-3 failures in chgres_cube as before.

So - using a set (vs unlimited) stacksize does NOT resolve the seg-faults. I used both 500MB and 1GB (converted to bytes), same behavior.

climbfuji commented 3 years ago

I got one failure out of 19, too. See here: https://github.com/ufs-community/ufs-mrweather-app/issues/190#issuecomment-692707076

So this doesn't seem to be the issue, but at least none of our model runs where hanging or experiencing the grib2 reading issue. I wonder what else it could be that causes chgres_cube.exe to fail.

One thing we could try is to replace MPT with Intel MPI. @uturuncoglu what do you think? I believe I know enough CIME to make this change locally (and compile the NCEPLIBS and ufs-weather-model with it).

uturuncoglu commented 3 years ago

@llpcarson could you share your last configuration and let me know If I need to put any mods to CIME

uturuncoglu commented 3 years ago

@climbfuji There could be a memory leak in the chgres_cube side but it might be hard to find it. For cheyyene, do you want to support only Intel MPI. The GNU configuration is also using MPT I think.

llpcarson commented 3 years ago

@uturuncoglu I would suggest no changes to the configs based on stacksize, since it didn't change the behavior. If Dom's tests with impi work, then that would be a change.