Closed llpcarson closed 4 years ago
Based on this: RLIMIT_STACK The maximum size of the process stack, in bytes. Upon reaching this limit, a SIGSEGV signal is generated.
I'm going to speculate that the stacksize number is not being used correctly, effectively setting a stacksize of zero (maybe?)... hence even the compiler is seg-faulting?
I tried a single-test case and ran case.build -v --- and didn't get much more useful information.
The correct value does get into:
env_mach_specific.xml
so - maybe my speculation is a false lead!
You could make your changes in env_mach_specific.xml (it is in case folder) and issue ./case.setup --reset --keep env_mach_specific.xml
and try to build again.
Based on this: RLIMIT_STACK The maximum size of the process stack, in bytes. Upon reaching this limit, a SIGSEGV signal is generated.
I'm going to speculate that the stacksize number is not being used correctly, effectively setting a stacksize of zero (maybe?)... hence even the compiler is seg-faulting?
I tried a single-test case and ran case.build -v --- and didn't get much more useful information.
According to your comment RLIMIT_STACK is in bytes. However, the limit reported by bash via ulimit -s
is in kbytes (example from stampede, values don't matter):
...
stack size (kbytes, -s) unlimited
...
So it seems you should try 300000 - which is the value reported on cheyenne using ulimit s
and corresponds to a little less than 300MB - TIMES 1024 if RLIMIT_STACK is really in bytes.
Using "bytes", the compile is running OK so far. I will start a full RT shortly to see if this fixes the other intermittent errors.
@llpcarson that is great.
After fixing the units of the stacksize (!), the tests compile OK, and have the same 2-3 failures in chgres_cube as before.
So - using a set (vs unlimited) stacksize does NOT resolve the seg-faults. I used both 500MB and 1GB (converted to bytes), same behavior.
I got one failure out of 19, too. See here: https://github.com/ufs-community/ufs-mrweather-app/issues/190#issuecomment-692707076
So this doesn't seem to be the issue, but at least none of our model runs where hanging or experiencing the grib2 reading issue. I wonder what else it could be that causes chgres_cube.exe to fail.
One thing we could try is to replace MPT with Intel MPI. @uturuncoglu what do you think? I believe I know enough CIME to make this change locally (and compile the NCEPLIBS and ufs-weather-model with it).
@llpcarson could you share your last configuration and let me know If I need to put any mods to CIME
@climbfuji There could be a memory leak in the chgres_cube side but it might be hard to find it. For cheyyene, do you want to support only Intel MPI. The GNU configuration is also using MPT I think.
@uturuncoglu I would suggest no changes to the configs based on stacksize, since it didn't change the behavior. If Dom's tests with impi work, then that would be a change.
While attempting to track down intermittent crashes, I tried to set the stacksize for the RT to a fixed limit (vs unlimited). When doing so, this causes the compile step to fail - the compilers seg-fault (both icc and ifort)!
The original code runs OK (so not a system problem, it seems) Removing the RLIMIT_STACK section also crashes in the same way.
Details: I changed the stacksize limit in CIME, and find a very odd error.
/glade/scratch/carson/ufs/mrw.test/stack/ufs-mrweather-app/cime/scripts (and related). in cime/config/ufs/machines/config_machines.xml,
run-dirs are at:
/glade/scratch/carson/ufs/mrw.test/stack/*
Errors are at compile-time:
Case dir: /glade/scratch/carson/ufs/mrw.test/stack/SMS_Lh3.C96.GFSv15p2.cheyenne_intel.20200914_152815_3okd1u
Errors are: _Building test for SMS in directory /glade/scratch/carson/ufs/mrw.test/stack/SMS_Lh3.C96.GFSv15p2.cheyenne_intel.20200914_152815_3okd1u ERROR: /glade/scratch/carson/ufs/mrw.test/stack/ufs-mrweather-app/cime/src/build_scripts/buildlib.cprnc FAILED, cat /glade/scratch/carson/ufs/mrw.test/stack/SMS_Lh3.C96.GFSv15p2.cheyenne_intel.20200914_1528153okd1u/bld/cprnc.bldlog.200914-152846
and that log file has the compiler-seg-fault errors I thought were related to a machine problem...
/glade/u/apps/ch/opt/ncarcompilers/0.5.0/intel/19.0.5/ifort: line 116: 4103 Segmentation fault