Closed climbfuji closed 3 years ago
@climbfuji can you confirm the compiler and mpi used for the hpc-stack in this case? I tried to install the hpc-stack on our macos server, but it failed with Intel. Before fighting that battle, I'd rather just reproduce your exact build of the stack as closely as possible.
For reference, the ESMF internal ticket number for this is 3615089.
I am using gcc+gfortran 9.2.0 installed via homebrew (roughly following the installation guide in lines 5-50 in https://github.com/NOAA-EMC/NCEPLIBS-external/blob/develop/doc/README_macos_gccgfortran.txt).
I haven't tried clang-9.0.0+gfortran-9.2.0 yet (similarly, following lines 5-58 in https://github.com/NOAA-EMC/NCEPLIBS-external/blob/develop/doc/README_macos_clanggfortran.txt).
MPI is mpich 3.3.1-3.3.2.
Both work with bs21 (and anything earlier than that back to 7.1.0r). Since I can't get a proper stack trace with gcc+gfortran, I'll try clang+gfortran next, and then the good old strategy to add print statements to see where it fails.
If you know of a tool for macOS that works similar to addr2line
on Linux, please share that knowledge with me. My google search hasn't been successful thus far.
Ok, with LLVM clang + GNU gfortran I get a little more information, maybe this is enough for the ESMF developers to see what is going on (unfortunately, still no information on source file or line numbers):
+ mpiexec.hydra -prepend-rank -n 6 ./fv3.exe
[0]
[0]
[0] * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * .
[0] PROGRAM nems HAS BEGUN. COMPILED 0.00 ORG: np23
[0] STARTING DATE-TIME NOV 30,2020 13:53:57.558 335 MON 2459184
[0]
[0]
[0] terminate called after throwing an instance of 'std::out_of_range'
[1] terminate called after throwing an instance of 'std::out_of_range'
[2] terminate called after throwing an instance of 'std::out_of_range'
[4] terminate called after throwing an instance of 'std::out_of_range'
[0] what(): map::at: key not found
[0]
[0] Program received signal SIGABRT: Process abort signal.
[0]
[0] Backtrace for this error:
[1] what(): map::at: key not found
[1]
[1] Program received signal SIGABRT: Process abort signal.
[1]
[1] Backtrace for this error:
[2] what(): map::at: key not found
[2]
[2] Program received signal SIGABRT: Process abort signal.
[2]
[2] Backtrace for this error:
[4] what(): map::at: key not found
[4]
[4] Program received signal SIGABRT: Process abort signal.
[4]
[4] Backtrace for this error:
[3] terminate called after throwing an instance of 'std::out_of_range'
[5] terminate called after throwing an instance of 'std::out_of_range'
[3] what(): map::at: key not found
[3]
[3] Program received signal SIGABRT: Process abort signal.
[3]
[3] Backtrace for this error:
[5] what(): map::at: key not found
[5]
[5] Program received signal SIGABRT: Process abort signal.
[5]
[5] Backtrace for this error:
[1] #0 0x11714cf3d
[1] #1 0x11714c34d
[1] #2 0x7fff6230eb5c
[2] #0 0x1192eff3d
[2] #1 0x1192ef34d
[2] #2 0x7fff6230eb5c
[3] #0 0x11fb77f3d
[3] #1 0x11fb7734d
[3] #2 0x7fff6230eb5c
[4] #0 0x11df0df3d
[4] #1 0x11df0d34d
[4] #2 0x7fff6230eb5c
[0] #0 0x118eb0f3d
[0] #1 0x118eb034d
[0] #2 0x7fff6230eb5c
[5] #0 0x1159f0f3d
[5] #1 0x1159f034d
[5] #2 0x7fff6230eb5c
Here is some more information. I enabled the output of all MESSAGE_CHECK
lines. The last message written is Create the NEMS Import/Export States
. Looking at NEMS/src/MAIN_NEMS.F90
, it fails between that message and the next one, Execute the NEMS Component Initialize Step
. This should narrow down our search.
MESSAGE_CHECK="Create the NEMS Import/Export States"
CALL ESMF_LogWrite(MESSAGE_CHECK,ESMF_LOGMSG_INFO,rc=RC)
! ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
!
NEMS_IMP_STATE=ESMF_StateCreate(name='NEMS Import State' &
,rc =RC)
ESMF_ERR_ABORT(RC)
!
NEMS_EXP_STATE=ESMF_StateCreate(name='NEMS Export State' &
,rc =RC)
ESMF_ERR_ABORT(RC)
!
! ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
!
!-----------------------------------------------------------------------
!*** Execute the INITIALIZE step for the NEMS component.
!*** The Initialize routine that is called here as well as the
!*** Run and Finalize routines invoked below are those specified
!*** in the Register routine called in ESMF_GridCompSetServices above.
!-----------------------------------------------------------------------
!
! ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
MESSAGE_CHECK="Execute the NEMS Component Initialize Step"
@climbfuji thanks for helping to narrow this down.
I am trying to build the hpc-stack on our mac server. I am using gnu 8.4.0 and I get a failure building the PIO package (see below).
I am building with:
./build_stack.sh -p $PWD/install-gnu -c config/config_mac.sh -y config/stack_mac.yaml
The error:
[ 77%] Linking C executable darray_no_async
cd /project/esmf/rocky/ufs/hpc-stack/pkg/pio-2.5.1/build/examples/c && /Volumes/esmf/rocky/ufs/cmake-3.19.1-Darwin-x86_64/CMake.app/Contents/bin/cmake -E cmake_link_script CMakeFiles/darray_no_async.dir/link.txt --verbose=1
[ 83%] Building Fortran object src/flib/CMakeFiles/piof.dir/pio_support.F90.o
[ 83%] Building Fortran object src/flib/CMakeFiles/piof.dir/pio_types.F90.o
cd /project/esmf/rocky/ufs/hpc-stack/pkg/pio-2.5.1/build/src/flib && /project/esmf/rocky/ufs/hpc-stack/install-gnu/bin/mpifort -DCPRGNU -DDARWIN -DLOGGING -DNETCDF_C_LOGGING_ENABLED -D_NETCDF -D_NOPNETCDF -I/project/esmf/rocky/ufs/hpc-stack/pkg/pio-2.5.1/build -I/project/esmf/rocky/ufs/hpc-stack/pkg/pio-2.5.1/src/flib -I/project/esmf/rocky/ufs/hpc-stack/pkg/pio-2.5.1/build/src/flib -I/usr/local/include -I/project/esmf/rocky/ufs/hpc-stack/pkg/pio-2.5.1/src/clib -fPIC -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk -ffree-line-length-none -c /project/esmf/rocky/ufs/hpc-stack/pkg/pio-2.5.1/src/flib/pio_support.F90 -o CMakeFiles/piof.dir/pio_support.F90.o
cd /project/esmf/rocky/ufs/hpc-stack/pkg/pio-2.5.1/build/src/flib && /project/esmf/rocky/ufs/hpc-stack/install-gnu/bin/mpifort -DCPRGNU -DDARWIN -DLOGGING -DNETCDF_C_LOGGING_ENABLED -D_NETCDF -D_NOPNETCDF -I/project/esmf/rocky/ufs/hpc-stack/pkg/pio-2.5.1/build -I/project/esmf/rocky/ufs/hpc-stack/pkg/pio-2.5.1/src/flib -I/project/esmf/rocky/ufs/hpc-stack/pkg/pio-2.5.1/build/src/flib -I/usr/local/include -I/project/esmf/rocky/ufs/hpc-stack/pkg/pio-2.5.1/src/clib -fPIC -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk -ffree-line-length-none -c /project/esmf/rocky/ufs/hpc-stack/pkg/pio-2.5.1/src/flib/pio_types.F90 -o CMakeFiles/piof.dir/pio_types.F90.o
/project/esmf/rocky/ufs/hpc-stack/install-gnu/bin/mpicc -fPIC -std=c99 -g -O0 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk -Wl,-search_paths_first -Wl,-headerpad_max_install_names -L/project/esmf/rocky/ufs/hpc-stack/install-gnu/lib -lz -ldl -lm CMakeFiles/example1.dir/example1.c.o -o example1 ../../src/clib/libpioc.a /usr/local/lib/libnetcdf.dylib
/project/esmf/rocky/ufs/hpc-stack/install-gnu/bin/mpicc -fPIC -std=c99 -g -O0 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk -Wl,-search_paths_first -Wl,-headerpad_max_install_names -L/project/esmf/rocky/ufs/hpc-stack/install-gnu/lib -lz -ldl -lm CMakeFiles/darray_no_async.dir/darray_no_async.c.o -o darray_no_async ../../src/clib/libpioc.a /usr/local/lib/libnetcdf.dylib
Undefined symbols for architecture x86_64:
"_MPI_Comm_f2c", referenced from:
_PIOc_Init_Intracomm_from_F90 in libpioc.a(pioc.c.o)
_PIOc_readmap_from_f90 in libpioc.a(pioc_support.c.o)
_PIOc_writemap_from_f90 in libpioc.a(pioc_support.c.o)
"_ompi_mpi_byte", referenced from:
_find_mpi_type in libpioc.a(pioc_support.c.o)
_PIOc_put_att_tc in libpioc.a(pio_getput_int.c.o)
_PIOc_get_att_tc in libpioc.a(pio_getput_int.c.o)
_PIOc_get_vars_tc in libpioc.a(pio_getput_int.c.o)
_PIOc_put_vars_tc in libpioc.a(pio_getput_int.c.o)
_att_put_handler in libpioc.a(pio_msg.c.o)
_put_vars_handler in libpioc.a(pio_msg.c.o)
...
Undefined" _ompi_mpi_char"symbols for, referenced from:
architecture x86_64:
" _MPI_Comm_f2c", _PIOc_set_iosystem_error_handling referencedin libpioc.a (from:pioc.c.o
)
_PIOc_InitDecomp in _PIOc_Init_Intracomm_from_F90libpioc.a(pioc.c.o)
in libpioc.a(_PIOc_deletefilepioc.c.o )
in libpioc.a(pio_file.c.o )
_PIOc_readmap_from_f90 _PIOc_inq inin libpioc.a( pioc_support.c.olibpioc.a()pio_nc.c.o
)
_PIOc_writemap_from_f90 in _PIOc_inq_unlimdims libpioc.ain (libpioc.a(pioc_support.c.o)pio_nc.c.o)
" _ompi_mpi_byte _PIOc_inq_type", referencedin libpioc.a(from:pio_nc.c.o)
_PIOc_inq_format in libpioc.a_find_mpi_type( in pio_nc.c.olibpioc.a)(
pioc_support.c.o)
_PIOc_put_att_tc .in ..
libpioc.a(pio_getput_int.c.o ) "
_ompi_mpi_comm_null " , referenced from :
_PIOc_get_att_tc in _PIOc_iosystem_is_activelibpioc.a (in libpioc.a(pio_getput_int.c.o)
pioc.c.o )
_PIOc_get_vars_tc in libpioc.a(pio_getput_int.c.o)
_PIOc_Init_Intracomm in libpioc.a(pioc.c.o_PIOc_put_vars_tc)
in libpioc.a( pio_getput_int.c.o)
_PIOc_free_iosystem in _att_put_handler in libpioc.a( pio_msg.c.o)libpioc.a
( pioc.c.o _put_vars_handler )in
_PIOc_init_async inlibpioc.a (libpioc.a(pio_msg.c.opioc.c.o))
...
" _ompi_mpi_char"",_ompi_mpi_comm_world "referenced, from:
referenced from :
_PIOc_set_iosystem_error_handling in libpioc.a(_mainpioc.c.o )in
darray_no_async.c.o
_piodie_PIOc_InitDecomp in libpioc.a(pioc_support.c.o )
in libpioc.a ( pioc.c.o_pio_err)
in libpioc.a _PIOc_deletefile( in pioc_support.c.o)
libpioc.a(pio_file.c.o)
_PIOc_inq in libpioc.a(pio_nc.c.o")_ompi_mpi_datatype_null
" , referenced_PIOc_inq_unlimdims fromin :libpioc.a(pio_nc.c.o)
_PIOc_inq_type in libpioc.a(pio_nc.c.o )_PIOc_def_var
in _PIOc_inq_formatlibpioc.a (inpio_nc.c.o libpioc.a()pio_nc.c.o
)
_malloc_iodesc...
"in_ompi_mpi_comm_null" libpioc.a,( pioc_support.c.oreferenced from:)
_inq_file_metadata in libpioc.a_PIOc_iosystem_is_active in (pioc_support.c.o)
"libpioc.a_ompi_mpi_double"(pioc.c.o),
referenced from :
_PIOc_Init_Intracomm in libpioc.a(pioc.c.o )
_find_mpi_type in libpioc.a ( pioc_support.c.o )
"_ompi_mpi_errors_return"_PIOc_free_iosystem, referencedin from:
libpioc.a ( _main pioc.c.oin) darray_no_async.c.o
"
_ompi_mpi_float ", referenced_PIOc_init_async from:in
libpioc.a (_find_mpi_type in libpioc.a(pioc.c.o)pioc_support.c.o
)
"_ompi_mpi_comm_world" ,_set_var_chunk_cache_handler referenced from:
in libpioc.a ( pio_msg.c.o_main)
in example1.c.o
_set_chunk_cache_handler in libpioc.a (pio_msg.c.o)
_piodie _PIOc_set_chunk_cache in libpioc.a(inpio_nc4.c.o libpioc.a()pioc_support.c.o)
_pio_err in _PIOc_get_chunk_cache libpioc.ain( libpioc.apioc_support.c.o)
( "pio_nc4.c.o)
_ompi_mpi_datatype_null " , referenced from:
_PIOc_set_var_chunk_cache in libpioc.a( pio_nc4.c.o)_PIOc_def_var in
libpioc.a ( pio_nc.c.o)
_malloc_iodesc_PIOc_get_var_chunk_cache in inlibpioc.a(pio_nc4.c.o)
libpioc.a ( pioc_support.c.o.)..
"_inq_file_metadata_ompi_mpi_group_null "in, libpioc.areferenced( pioc_support.c.ofrom):
" _ompi_mpi_double",_PIOc_Init_Intracomm referenced from:in
libpioc.a_find_mpi_type(pioc.c.o )in
libpioc.a (pioc_support.c.o )"
_ompi_mpi_info_null ""_ompi_mpi_errors_return", referenced, from:
referenced from:
_PIOc_Init_Intracomm in _main in libpioc.a (pioc.c.o)
example1.c.o _PIOc_set_hint in libpioc.a(
"pioc.c.o_ompi_mpi_float)"
, referenced from:
_find_mpi_type in libpioc.a(_PIOc_free_iosystempioc_support.c.o)
in libpioc.a ( pioc.c.o)
_set_var_chunk_cache_handler in libpioc.a(_PIOc_init_asyncpio_msg.c.o )
_set_chunk_cache_handler inin libpioc.a (libpioc.a(pio_msg.c.opioc.c.o)
)
" _ompi_mpi_int", referenced from:
_PIOc_set_chunk_cache in libpioc.a (pio_nc4.c.o)_PIOc_advanceframe
in libpioc.a (pioc.c.o)_PIOc_get_chunk_cache
in_PIOc_setframe in libpioc.a( libpioc.a(pioc.c.opio_nc4.c.o)
)
_PIOc_set_iosystem_error_handling in libpioc.a(pioc.c.o)
_PIOc_set_var_chunk_cache in libpioc.a_PIOc_InitDecomp(pio_nc4.c.o)
in _PIOc_get_var_chunk_cache in libpioc.a(pio_nc4.c.o)
libpioc.a(pioc.c.o )
._PIOc_free_iosystem ..in
libpioc.a( pioc.c.o)
" _ompi_mpi_group_null", _PIOc_closefilereferenced from:in libpioc.a(
pio_file.c.o)
_PIOc_Init_Intracomm in libpioc.a(pioc.c.o )
"_PIOc_deletefile _ompi_mpi_info_null"in ,libpioc.a (referenced pio_file.c.ofrom):
...
_PIOc_Init_Intracomm in "libpioc.a(_ompi_mpi_long", pioc.c.oreferenced) from:
_PIOc_set_hint in libpioc.a_cn_buffer_report( in pioc.c.o)
libpioc.a( pio_darray_int.c.o )
"_ompi_mpi_offset", referenced from:
_PIOc_InitDecomp in_PIOc_free_iosystem libpioc.a(pioc.c.oin )libpioc.a
(pioc.c.o)
_PIOc_inq_type _PIOc_init_asyncin in libpioc.alibpioc.a((pio_nc.c.opioc.c.o)
)"_ompi_mpi_int"
, referenced from:
_PIOc_inq_dim _PIOc_advanceframe in inlibpioc.a (pio_nc.c.olibpioc.a()pioc.c.o
)
_PIOc_inq_att_eh in _PIOc_setframe libpioc.ain libpioc.a(pioc.c.o)(pio_nc.c.o)
_PIOc_def_var in libpioc.a(pio_nc.c.o)_PIOc_set_iosystem_error_handling
in libpioc.a (pioc.c.o)
_PIOc_def_var_fill in libpioc.a ( _PIOc_InitDecomppio_nc.c.o )
in libpioc.a (_PIOc_inq_var_fillpioc.c.o )in
libpioc.a ( pio_nc.c.o )
_PIOc_free_iosystem in ...libpioc.a
( "_ompi_mpi_op_maxpioc.c.o", )referenced
from :
_PIOc_write_nc_decomp in _PIOc_closefile in libpioc.alibpioc.a((pio_file.c.opioc_support.c.o))
_PIOc_write_darray in _PIOc_deletefile libpioc.a(pio_darray.c.o)in
libpioc.a(_compute_maxIObuffersize pio_file.c.oin)
libpioc.a( pio_rearrange.c.o )
...
_subset_rearrange_create in" _ompi_mpi_long", referenced from:
libpioc.a (pio_rearrange.c.o )
_cn_buffer_report in libpioc.a( pio_darray_int.c.o)
_cn_buffer_report "in _ompi_mpi_op_minlibpioc.a(pio_darray_int.c.o)
" ," _ompi_mpi_offset", referenced referencedfrom :
from :
_check_netcdf2_PIOc_InitDecomp in libpioc.a(pioc_support.c.oin)
libpioc.a(pioc.c.o )
_cn_buffer_report in_PIOc_inq_type libpioc.a(in libpioc.a(pio_darray_int.c.o)
pio_nc.c.o _compute_maxaggregate_bytes in libpioc.a()pio_darray_int.c.o)
" _PIOc_inq_dim in _ompi_mpi_op_sumlibpioc.a", referenced( from:
pio_nc.c.o)
_PIOc_inq_att_eh in _determine_filllibpioc.a (inpio_nc.c.o)
libpioc.a(pio_rearrange.c.o)
" _ompi_mpi_short", _PIOc_def_var in libpioc.a(referencedpio_nc.c.o)
from :
_find_mpi_type_PIOc_def_var_fill in inlibpioc.a libpioc.a(pio_nc.c.o)
( pioc_support.c.o )
"_PIOc_inq_var_fill _ompi_request_null", inreferenced from:
libpioc.a (_pio_swapm in pio_nc.c.olibpioc.a(pio_spmd.c.o))
ld:
symbol( s )...
not "found _ompi_mpi_op_max"for ,architecture x86_64referenced from
:
_PIOc_write_nc_decomp in libpioc.a(pioc_support.c.o)
_PIOc_write_darray in libpioc.a(pio_darray.c.o)
_compute_maxIObuffersize in libpioc.a(pio_rearrange.c.o)
_subset_rearrange_create in libpioc.a(pio_rearrange.c.o)
_cn_buffer_report in libpioc.a(pio_darray_int.c.o)
"_ompi_mpi_op_min", referenced from:
_check_netcdf2 in libpioc.a(pioc_support.c.o)
_cn_buffer_report in libpioc.a(pio_darray_int.c.o)
_compute_maxaggregate_bytes in collect2: error: ld returned 1 exit status
libpioc.a(pio_darray_int.c.o)
"_ompi_mpi_op_sum", referenced from:
_determine_fill in libpioc.a(pio_rearrange.c.o)
"_ompi_mpi_short", referenced from:
_find_mpi_type in libpioc.a(pioc_support.c.o)
"_ompi_request_null", referenced from:
_pio_swapm in libpioc.a(pio_spmd.c.o)
ld: symbol(s) not found for architecture x86_64
collect2: error: ld returned 1 exit status
make[2]: *** [examples/c/darray_no_async] Error 1
make[2]: *** [examples/c/example1] Error 1
make[1]: *** [examples/c/CMakeFiles/example1.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....
make[1]: /Volumes/esmf/rocky/ufs/cmake-3.19.1-Darwin-x86_64/CMake.app/Contents/bin/cmake -E touch src/flib/CMakeFiles/piof.dir/pio_kinds.F90.o.provides.build
*** [examples/c/CMakeFiles/darray_no_async.dir/all] Error 2
@jedwards4b do you have any insights?
Hey Rocky,
I just built with GCC 8.4 and it worked, mostly.
I had to disable ECKIT, FCKIT, and ATLAS, which are unimportant. Don't know why you ran into PIO issues.
Here's my config_mac.sh
#!/bin/bash
# Compiler/MPI combination
export HPC_COMPILER="gnu/8.4.0"
export HPC_MPI="mpich/3.3.1"
# Build options
export USE_SUDO=N
export PKGDIR=pkg
export LOGDIR=log
export OVERWRITE=Y
export NTHREADS=20
export MAKE_CHECK=N
export MAKE_VERBOSE=N
export MAKE_CLEAN=N
export DOWNLOAD_ONLY=N
export STACK_EXIT_ON_FAIL=Y
export WGET="wget -nv"
export SERIAL_FC=gfortran-mp-8
Then, I ran ../build_stack.sh -c config/config_mac.sh -p /Users/KIG/Desktop/hpc-stack/install -y config/stack_mac.yaml
Looks like you're building with OpenMPI? I used MPICH and built it from source using hpc-stack.
Those do appear to be OpenMPI symbols. However, I asked the hpc-stack to build mpich itself - and it appears that it did, and put the binaries (e.g., mpicxx) under install-gnu/bin and those appear to be used in the compile of PIO. Not sure where/why it is picking up OpenMPI...
I am curious about what this -isysroot is doing here:
/project/esmf/rocky/ufs/hpc-stack/install-gnu/bin/mpicc -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk
I am curious about what this -isysroot is doing here:
/project/esmf/rocky/ufs/hpc-stack/install-gnu/bin/mpicc -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk
You don't need PIO at all, just remove it from the stack_mac.yaml file (or set build to NO). Also remove everything starting from boost to the end of the file.
I got more information as well. The model run crashes in the import state, doesn't even get to the export state.
@climbfuji okay, the stack built with PIO and the other libs removed. I'll try the model build next. Do you have a run directory handy?
I can give you one, certainly. But I doubt the model will build with gnu 8.x.y - I remember that we had to make it a requirement to use gnu 9.m.n because of some Fortran 2008 features in the code that gnu 8 does not (!) support. But you can remove the guard if you come across it and see if it does build with your particular version.
I don't have gcc9 on our server, but I do have gcc10. Is 10 expected to work?
I don't have gcc9 on our server, but I do have gcc10. Is 10 expected to work?
That's a good one ;-) It will compile the stack and the model. The model will (would, better to say) crash at some point in the physics, but given that you won't get there it may do the job.
I'll try one other thing today, and that is using the native clang + gfortran instead of LLVM clang + gfortran. Maybe I can get a stack trace this way.
To answer @rsdunlapiv regarding the pio internal to ESMF - this pio version is older than openmpi and cannot be expected to support that library - what is the timeine to bring a modern version of pio into the esmf library?
@jedwards4b this is an externally built PIO, version 2.5.1. Since I don't need it for the immediate problem (the weather model is not using PIO in this config), there is no need to look into this right now.
Thanks for the clarification - so the problem is that the build is mixing openmpi and mpich libraries somehow.
@climbfuji you are right about gnu 8.x not being supported by the ufs-weather-model. I am setting up gcc@9 and rebuilding the stack and model.
@climbfuji I now have an hpc-stack and ufs_model executable for gcc9.3. Can you please let me know where that run directory is located?
@climbfuji I now have an hpc-stack and ufs_model executable for gcc9.3. Can you please let me know where that run directory is located?
Uploading one for you to Cheyenne, will let you know when it's up there. It's a fully self-contained test case running GFS v16beta using 6 MPI tasks (one per tile). If your machine has 16GB of memory, you can easily run this. It runs to completion with bs21 on my Mac. In that directory, all you need to do is edit run_macosx.sh
and set the path
FV3_BUILD_DIR=/Users/dom.heinzeller/scratch/ufs-weather-model/ufs-weather-model-timestep-init-finalize/llvm
to point to your top-level ufs-weather-model directory (so that $BUILD_DIR/tests/fv3.exe
can be found).
The script doesn't set any environment variables, so make sure that any PATH
, LD_LIBRARY_PATH
etc. environment variables are set correctly in your shell (if applicable).
@climbfuji I now have an hpc-stack and ufs_model executable for gcc9.3. Can you please let me know where that run directory is located?
Uploading one for you to Cheyenne, will let you know when it's up there. It's a fully self-contained test case running GFS v16beta using 6 MPI tasks (one per tile). If your machine has 16GB of memory, you can easily run this. It runs to completion with bs21 on my Mac. In that directory, all you need to do is edit
run_macosx.sh
and set the pathFV3_BUILD_DIR=/Users/dom.heinzeller/scratch/ufs-weather-model/ufs-weather-model-timestep-init-finalize/llvm
to point to your top-level ufs-weather-model directory (so that
$BUILD_DIR/tests/fv3.exe
can be found).The script doesn't set any environment variables, so make sure that any
PATH
,LD_LIBRARY_PATH
etc. environment variables are set correctly in your shell (if applicable).
@kgerheiser @aerorahul FYI, you wanted something like that, too.
Here it is: /glade/work/heinzell/rundir_fv3_ccpp_gfsv16beta_20201203
We don't have access to Cheyenne :)
us lowly humans are stuck to NOAA machines.
I would love the regression test framework to be able to pull down that configured run directory from cloud storage anywhere with internet access. We are living in the future, so I don't see why that would be so hard. ;)
I would love the regression test framework to be able to pull down that configured run directory from cloud storage anywhere with internet access. We are living in the future, so I don't see why that would be so hard. ;)
It's coming, actually. I do have the s3 bucket set up and ready to go, just need time to implement something like that.
us lowly humans are stuck to NOAA machines.
I consider you as the privileged ones having access to wcoss! Here you go:
/scratch1/BMC/gmtb/Dom.Heinzeller/rundir_fv3_ccpp_gfsv16beta_20201203
I used ufs-weather-model CMake followed by make install. I do not have a $BUILD_DIR/tests/fv3.exe
but I do have a $BUILD_DIR/install/bin/ufs_model
. Is that okay or should I build this a different way?
I used ufs-weather-model CMake followed by make install. I do not have a
$BUILD_DIR/tests/fv3.exe
but I do have a$BUILD_DIR/install/bin/ufs_model
. Is that okay or should I build this a different way?
Yes, should do. But do you have the correct suite compiled into the executable is the question. Compile like this:
cd tests
./compile.sh macosx.gnu 'CCPP=Y DEBUG=Y' '' NO NO 2>&1 | tee compile.log
That is giving me a CMake error:
cgdm-catania:tests dunlap$ pwd
/project/esmf/rocky/ufs/ufs-weather-model/tests
cgdm-catania:tests dunlap$ ./compile.sh macosx.gnu 'CCPP=Y DEBUG=Y' '' NO NO 2>&1 | tee compile.log
+ SECONDS=0
++ uname -s
+ [[ Darwin == Darwin ]]
++++ greadlink -f -n ./compile.sh
./compile.sh: line 16: greadlink: command not found
+++ dirname ''
++ cd .
++ pwd -P
+ readonly MYDIR=/Volumes/esmf/rocky/ufs/ufs-weather-model/tests
+ MYDIR=/Volumes/esmf/rocky/ufs/ufs-weather-model/tests
+ readonly ARGC=5
+ ARGC=5
+ [[ 5 -lt 2 ]]
+ MACHINE_ID=macosx.gnu
+ MAKE_OPT='CCPP=Y DEBUG=Y'
+ BUILD_NAME=fv3
+ clean_before=NO
+ clean_after=NO
++ cd /Volumes/esmf/rocky/ufs/ufs-weather-model/tests/..
++ pwd
+ PATHTR=/Volumes/esmf/rocky/ufs/ufs-weather-model
++ pwd
+ BUILD_DIR=/project/esmf/rocky/ufs/ufs-weather-model/tests/build_fv3
+ [[ macosx.gnu == cheyenne.* ]]
+ [[ macosx.gnu == wcoss_dell_p3 ]]
+ BUILD_JOBS=8
+ hostname
cgdm-catania
+ set +x
Setting environment variables for NEMSfv3gfs on MACOSX with gcc/gfortran or clang/gfortran
+ echo 'Compiling CCPP=Y DEBUG=Y into fv3.exe on macosx.gnu'
Compiling CCPP=Y DEBUG=Y into fv3.exe on macosx.gnu
+ CMAKE_FLAGS=
+ [[ CCPP=Y DEBUG=Y == *\D\E\B\U\G\=\Y* ]]
+ CMAKE_FLAGS=' -DDEBUG=Y'
+ [[ CCPP=Y DEBUG=Y == *\3\2\B\I\T\=\Y* ]]
+ [[ CCPP=Y DEBUG=Y == *\O\P\E\N\M\P\=\N* ]]
+ [[ CCPP=Y DEBUG=Y == *\M\U\L\T\I\_\G\A\S\E\S\=\Y* ]]
+ CMAKE_FLAGS=' -DDEBUG=Y -DMULTI_GASES=OFF'
+ [[ CCPP=Y DEBUG=Y == *\C\C\P\P\=\Y* ]]
+ mkdir -p /Volumes/esmf/rocky/ufs/ufs-weather-model/FV3/ccpp/include
+ mkdir -p /Volumes/esmf/rocky/ufs/ufs-weather-model/FMS/fms2_io/include
+ CMAKE_FLAGS=' -DDEBUG=Y -DMULTI_GASES=OFF -DCCPP=ON -DMPI=ON'
+ [[ CCPP=Y DEBUG=Y == *\D\E\B\U\G\=\Y* ]]
+ CMAKE_FLAGS=' -DDEBUG=Y -DMULTI_GASES=OFF -DCCPP=ON -DMPI=ON -DCMAKE_BUILD_TYPE=Debug'
+ [[ CCPP=Y DEBUG=Y == *\3\2\B\I\T\=\Y* ]]
+ CMAKE_FLAGS=' -DDEBUG=Y -DMULTI_GASES=OFF -DCCPP=ON -DMPI=ON -DCMAKE_BUILD_TYPE=Debug -DDYN32=OFF'
+ set +ex
+ [[ CCPP=Y DEBUG=Y == *\W\W\3\=\Y* ]]
+ [[ CCPP=Y DEBUG=Y == *\S\2\S\=\Y* ]]
+ [[ CCPP=Y DEBUG=Y == *\D\A\T\M\=\Y* ]]
++ trim ' -DDEBUG=Y -DMULTI_GASES=OFF -DCCPP=ON -DMPI=ON -DCMAKE_BUILD_TYPE=Debug -DDYN32=OFF'
++ local 'var= -DDEBUG=Y -DMULTI_GASES=OFF -DCCPP=ON -DMPI=ON -DCMAKE_BUILD_TYPE=Debug -DDYN32=OFF'
++ var='-DDEBUG=Y -DMULTI_GASES=OFF -DCCPP=ON -DMPI=ON -DCMAKE_BUILD_TYPE=Debug -DDYN32=OFF'
++ var='-DDEBUG=Y -DMULTI_GASES=OFF -DCCPP=ON -DMPI=ON -DCMAKE_BUILD_TYPE=Debug -DDYN32=OFF'
++ echo -n '-DDEBUG=Y -DMULTI_GASES=OFF -DCCPP=ON -DMPI=ON -DCMAKE_BUILD_TYPE=Debug -DDYN32=OFF'
+ CMAKE_FLAGS='-DDEBUG=Y -DMULTI_GASES=OFF -DCCPP=ON -DMPI=ON -DCMAKE_BUILD_TYPE=Debug -DDYN32=OFF'
+ '[' NO = YES ']'
+ export BUILD_VERBOSE=1
+ BUILD_VERBOSE=1
+ export BUILD_DIR
+ export BUILD_JOBS
+ export CCPP_SUITES
+ export CMAKE_FLAGS
+ bash -x /Volumes/esmf/rocky/ufs/ufs-weather-model/build.sh
+ set -eu
++ uname -s
+ [[ Darwin == Darwin ]]
++++ greadlink -f -n /Volumes/esmf/rocky/ufs/ufs-weather-model/build.sh
/Volumes/esmf/rocky/ufs/ufs-weather-model/build.sh: line 5: greadlink: command not found
+++ dirname ''
++ cd .
++ pwd -P
+ readonly UFS_MODEL_DIR=/Volumes/esmf/rocky/ufs/ufs-weather-model/tests
+ UFS_MODEL_DIR=/Volumes/esmf/rocky/ufs/ufs-weather-model/tests
+ export CMAKE_C_COMPILER=mpicc
+ CMAKE_C_COMPILER=mpicc
+ export CMAKE_CXX_COMPILER=mpicxx
+ CMAKE_CXX_COMPILER=mpicxx
+ export CMAKE_Fortran_COMPILER=mpifort
+ CMAKE_Fortran_COMPILER=mpifort
+ export NETCDF=/project/esmf/rocky/ufs/hpc-stack/install-gnu9
+ NETCDF=/project/esmf/rocky/ufs/hpc-stack/install-gnu9
+ export ESMFMKFILE=/project/esmf/rocky/ufs/hpc-stack/install-gnu9/lib/esmf.mk
+ ESMFMKFILE=/project/esmf/rocky/ufs/hpc-stack/install-gnu9/lib/esmf.mk
+ BUILD_DIR=/project/esmf/rocky/ufs/ufs-weather-model/tests/build_fv3
+ mkdir -p /project/esmf/rocky/ufs/ufs-weather-model/tests/build_fv3
+ [[ -n '' ]]
+ CMAKE_FLAGS+=' -DNETCDF_DIR=/project/esmf/rocky/ufs/hpc-stack/install-gnu9'
+ cd /project/esmf/rocky/ufs/ufs-weather-model/tests/build_fv3
+ cmake /Volumes/esmf/rocky/ufs/ufs-weather-model/tests -DDEBUG=Y -DMULTI_GASES=OFF -DCCPP=ON -DMPI=ON -DCMAKE_BUILD_TYPE=Debug -DDYN32=OFF -DNETCDF_DIR=/project/esmf/rocky/ufs/hpc-stack/install-gnu9
CMake Error: The source directory "/project/esmf/rocky/ufs/ufs-weather-model/tests" does not appear to contain CMakeLists.txt.
That is giving me a CMake error:
cgdm-catania:tests dunlap$ pwd /project/esmf/rocky/ufs/ufs-weather-model/tests cgdm-catania:tests dunlap$ ./compile.sh macosx.gnu 'CCPP=Y DEBUG=Y' '' NO NO 2>&1 | tee compile.log + SECONDS=0 ++ uname -s + [[ Darwin == Darwin ]] ++++ greadlink -f -n ./compile.sh ./compile.sh: line 16: greadlink: command not found +++ dirname '' ++ cd . ++ pwd -P + readonly MYDIR=/Volumes/esmf/rocky/ufs/ufs-weather-model/tests + MYDIR=/Volumes/esmf/rocky/ufs/ufs-weather-model/tests + readonly ARGC=5 + ARGC=5 + [[ 5 -lt 2 ]] + MACHINE_ID=macosx.gnu + MAKE_OPT='CCPP=Y DEBUG=Y' + BUILD_NAME=fv3 + clean_before=NO + clean_after=NO ++ cd /Volumes/esmf/rocky/ufs/ufs-weather-model/tests/.. ++ pwd + PATHTR=/Volumes/esmf/rocky/ufs/ufs-weather-model ++ pwd + BUILD_DIR=/project/esmf/rocky/ufs/ufs-weather-model/tests/build_fv3 + [[ macosx.gnu == cheyenne.* ]] + [[ macosx.gnu == wcoss_dell_p3 ]] + BUILD_JOBS=8 + hostname cgdm-catania + set +x Setting environment variables for NEMSfv3gfs on MACOSX with gcc/gfortran or clang/gfortran + echo 'Compiling CCPP=Y DEBUG=Y into fv3.exe on macosx.gnu' Compiling CCPP=Y DEBUG=Y into fv3.exe on macosx.gnu + CMAKE_FLAGS= + [[ CCPP=Y DEBUG=Y == *\D\E\B\U\G\=\Y* ]] + CMAKE_FLAGS=' -DDEBUG=Y' + [[ CCPP=Y DEBUG=Y == *\3\2\B\I\T\=\Y* ]] + [[ CCPP=Y DEBUG=Y == *\O\P\E\N\M\P\=\N* ]] + [[ CCPP=Y DEBUG=Y == *\M\U\L\T\I\_\G\A\S\E\S\=\Y* ]] + CMAKE_FLAGS=' -DDEBUG=Y -DMULTI_GASES=OFF' + [[ CCPP=Y DEBUG=Y == *\C\C\P\P\=\Y* ]] + mkdir -p /Volumes/esmf/rocky/ufs/ufs-weather-model/FV3/ccpp/include + mkdir -p /Volumes/esmf/rocky/ufs/ufs-weather-model/FMS/fms2_io/include + CMAKE_FLAGS=' -DDEBUG=Y -DMULTI_GASES=OFF -DCCPP=ON -DMPI=ON' + [[ CCPP=Y DEBUG=Y == *\D\E\B\U\G\=\Y* ]] + CMAKE_FLAGS=' -DDEBUG=Y -DMULTI_GASES=OFF -DCCPP=ON -DMPI=ON -DCMAKE_BUILD_TYPE=Debug' + [[ CCPP=Y DEBUG=Y == *\3\2\B\I\T\=\Y* ]] + CMAKE_FLAGS=' -DDEBUG=Y -DMULTI_GASES=OFF -DCCPP=ON -DMPI=ON -DCMAKE_BUILD_TYPE=Debug -DDYN32=OFF' + set +ex + [[ CCPP=Y DEBUG=Y == *\W\W\3\=\Y* ]] + [[ CCPP=Y DEBUG=Y == *\S\2\S\=\Y* ]] + [[ CCPP=Y DEBUG=Y == *\D\A\T\M\=\Y* ]] ++ trim ' -DDEBUG=Y -DMULTI_GASES=OFF -DCCPP=ON -DMPI=ON -DCMAKE_BUILD_TYPE=Debug -DDYN32=OFF' ++ local 'var= -DDEBUG=Y -DMULTI_GASES=OFF -DCCPP=ON -DMPI=ON -DCMAKE_BUILD_TYPE=Debug -DDYN32=OFF' ++ var='-DDEBUG=Y -DMULTI_GASES=OFF -DCCPP=ON -DMPI=ON -DCMAKE_BUILD_TYPE=Debug -DDYN32=OFF' ++ var='-DDEBUG=Y -DMULTI_GASES=OFF -DCCPP=ON -DMPI=ON -DCMAKE_BUILD_TYPE=Debug -DDYN32=OFF' ++ echo -n '-DDEBUG=Y -DMULTI_GASES=OFF -DCCPP=ON -DMPI=ON -DCMAKE_BUILD_TYPE=Debug -DDYN32=OFF' + CMAKE_FLAGS='-DDEBUG=Y -DMULTI_GASES=OFF -DCCPP=ON -DMPI=ON -DCMAKE_BUILD_TYPE=Debug -DDYN32=OFF' + '[' NO = YES ']' + export BUILD_VERBOSE=1 + BUILD_VERBOSE=1 + export BUILD_DIR + export BUILD_JOBS + export CCPP_SUITES + export CMAKE_FLAGS + bash -x /Volumes/esmf/rocky/ufs/ufs-weather-model/build.sh + set -eu ++ uname -s + [[ Darwin == Darwin ]] ++++ greadlink -f -n /Volumes/esmf/rocky/ufs/ufs-weather-model/build.sh /Volumes/esmf/rocky/ufs/ufs-weather-model/build.sh: line 5: greadlink: command not found +++ dirname '' ++ cd . ++ pwd -P + readonly UFS_MODEL_DIR=/Volumes/esmf/rocky/ufs/ufs-weather-model/tests + UFS_MODEL_DIR=/Volumes/esmf/rocky/ufs/ufs-weather-model/tests + export CMAKE_C_COMPILER=mpicc + CMAKE_C_COMPILER=mpicc + export CMAKE_CXX_COMPILER=mpicxx + CMAKE_CXX_COMPILER=mpicxx + export CMAKE_Fortran_COMPILER=mpifort + CMAKE_Fortran_COMPILER=mpifort + export NETCDF=/project/esmf/rocky/ufs/hpc-stack/install-gnu9 + NETCDF=/project/esmf/rocky/ufs/hpc-stack/install-gnu9 + export ESMFMKFILE=/project/esmf/rocky/ufs/hpc-stack/install-gnu9/lib/esmf.mk + ESMFMKFILE=/project/esmf/rocky/ufs/hpc-stack/install-gnu9/lib/esmf.mk + BUILD_DIR=/project/esmf/rocky/ufs/ufs-weather-model/tests/build_fv3 + mkdir -p /project/esmf/rocky/ufs/ufs-weather-model/tests/build_fv3 + [[ -n '' ]] + CMAKE_FLAGS+=' -DNETCDF_DIR=/project/esmf/rocky/ufs/hpc-stack/install-gnu9' + cd /project/esmf/rocky/ufs/ufs-weather-model/tests/build_fv3 + cmake /Volumes/esmf/rocky/ufs/ufs-weather-model/tests -DDEBUG=Y -DMULTI_GASES=OFF -DCCPP=ON -DMPI=ON -DCMAKE_BUILD_TYPE=Debug -DDYN32=OFF -DNETCDF_DIR=/project/esmf/rocky/ufs/hpc-stack/install-gnu9 CMake Error: The source directory "/project/esmf/rocky/ufs/ufs-weather-model/tests" does not appear to contain CMakeLists.txt.
This is the error:
/Volumes/esmf/rocky/ufs/ufs-weather-model/build.sh: line 5: greadlink: command not found
I believe this can be resolved by
brew install coreutils
That worked. The model finished fine with ESMF8.1.0bs21 as expected. I switched to ESMF8.1.0bs27 where I expect to see the same failure as @climbfuji did.
@climbfuji unfortunately, my run with ESMF 8.1.0bs27 worked! From PET0.ESMF_LogFile
20201203 144112.539 INFO PET0 Running with ESMF Version : ESMF_8_1_0_beta_snapshot_27
20201203 144112.539 INFO PET0 ESMF library build date/time: "Dec 3 2020" "14:14:39"
20201203 144112.539 INFO PET0 ESMF library build location : /project/esmf/rocky/ufs/hpc-stack/pkg/ESMF_8_1_0_beta_snapshot_27
It terminated normally:
[0] ENDING DATE-TIME DEC 03,2020 14:47:58.918 338 THU 2459187
[0] PROGRAM nems HAS ENDED.
[0] * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * .
[0] *****************RESOURCE STATISTICS*******************************
[0] The total amount of wall time = 0.000000
[0] The total amount of time in user mode = 389.377709
[0] The total amount of time in sys mode = 11.639710
[0] *****************END OF RESOURCE STATISTICS*************************
[0]
This is with gnu/9.3.0 and mpich/3.3.1 with ESMF 8.1.0bs27 built in debug mode. I think you were using gnu/9.2.0, but I'm not convinced that that is the most likely difference. Not sure what to do next - should I try a later version of ESMF?
Or do we need to look at something else OS-level?
I'll give it a try and see what I get
Thanks for testing this Rocky. Too bad. I'll go and try my other laptop next.
What macOS version is yours?
gdm-catania:rundir_fv3_ccpp_gfsv16beta_20201203 dunlap$ system_profiler SPSoftwareDataType
Software:
System Software Overview:
System Version: macOS 10.14.6 (18G103)
Kernel Version: Darwin 18.7.0
Since I'm already set up, I'll try the latest ESMF snapshot as well to see if anything comes up....
gdm-catania:rundir_fv3_ccpp_gfsv16beta_20201203 dunlap$ system_profiler SPSoftwareDataType Software:
System Software Overview: System Version: macOS 10.14.6 (18G103) Kernel Version: Darwin 18.7.0
Mine is almost the same
System Software Overview:
System Version: macOS 10.14.6 (18G4032)
Kernel Version: Darwin 18.7.0
And I had tried the native AppleClang 10.0.1 with gfortran 9.2.0, LLVM Clang 9.0.0 with gfortran 9.2.0, and GNU gcc 9.2.0 with gfortran 9.2.0.
Since I can't get a stacktrace on my mac, is there a way to crank up the verbosity of ESMF, similar to what I found in MAIN_NEMS.F90
(remove the !
in front of all CALL ESMF_LogWrite
calls)? Or some debug flag/parameter that can be turned on?
@climbfuji I don't think there is anything already in ESMF that would produce extra output within ESMF_StateCreate which seems to be where things are dying. Maybe @theurich has a suggestion? One options would be for us to provide a branch of ESMF instrumented with some debug messages deeper into ESMF_StateCreate to see if we could track it down.
Another option would be to see if you can get one of the processors into a debugger. Since it is failing SO early, you might even be able to just run it on ONE process in GDB and it might expose this error before it complains about not running on enough PETs.
@rsdunlapiv adding print statements is tedious, but works ... I know by now that it fails in this block of code (between debug statement "E" and "F"):
! DH*
CALL ESMF_LogWrite("DH DEBUG inside actual flag E",ESMF_LOGMSG_INFO,rc=RC)
! *DH
if (present(nestedStateList)) then
do i=1,size(nestedStateList)
ESMF_INIT_CHECK_DEEP(ESMF_StateGetInit,nestedStateList(i),rc)
enddo
endif
! DH*
CALL ESMF_LogWrite("DH DEBUG inside actual flag F",ESMF_LOGMSG_INFO,rc=RC)
! *DH
Next is to check what is in this nestedStateList
...
I finally got around to running it with ESMF beta 21 and 27. GCC-9, macOS 11.0, mpich 3.3.1 (built with hpc-stack).
They both fail, but I think it's unrelated to ESMF, and it gets farther than you.
FATAL from PE 0: MPP_OPEN: error in OPEN for RESTART/file.
FATAL from PE 0: MPP_OPEN: error in OPEN for RESTART/file.
Looks like it's looking for a file in RESTART
which is empty. Is there some option I need to change?
I just didn't have a RESTART
folder at all. Making an empty one fixed it.
It works with ESMF beta 27 and 21.
No, the case that I gave you runs out of the box.
On Dec 7, 2020, at 8:33 AM, Kyle Gerheiser notifications@github.com wrote:
I finally got around to running it with ESMF beta 21 and 27.
They both fail, but I think it's unrelated to ESMF, and it gets farther than you.
FATAL from PE 0: MPP_OPEN: error in OPEN for RESTART/file.
FATAL from PE 0: MPP_OPEN: error in OPEN for RESTART/file.
Looks like it's looking for a file in RESTART which is empty. Is there some option I need to change?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/issues/303#issuecomment-739992378, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5C2RIK6CGVMK5A5U2I523STTYTHANCNFSM4UHX4D7Q.
I had no RESTART
folder. So, I just created an empty one and that fixed it.
I had no
RESTART
folder. So, I just created an empty one and that fixed it.
Thanks for figuring that out. I am still debugging bs27 on my (weird ?) Mac ... if it hadn't been working just fine for two years up to bs21, I wouldn't be that worried.
Alright, adding more print statements around that present(nestedStateList)
test makes the code pass the Create the NEMS Import State
, but then it crashes around line endif ! - actualFlag
in Create the NEMS Export State
.
This doesn't make any sense. We are either facing a memory corruption (in ESMF) or a bug in the compiler or one of the libraries that gets used. Will try a brew update
next.
@climbfuji I agree it sounds like a memory corruption. It will be interesting to see if @kgerheiser can get it to run. Not sure if we could also get one other test on another mac machine to see how isolated the issue is.
@rsdunlapiv I was able to run it without issue
@climbfuji I agree it sounds like a memory corruption. It will be interesting to see if @kgerheiser can get it to run. Not sure if we could also get one other test on another mac machine to see how isolated the issue is.
Since I can't get a stack trace on macOS (and even if I could it may not be helpful), we could try valgrind if I get it running on macOS (or valgrid ir DDT on cheyenne with GNU 9.3, for example). Since it crashes so early on, we should have a fair chance to detect a possible memory corruption.
Did you try running the whole thing on just one process through GDB?
Description
Updating ESMF 8.1.0 beta snapshot 27 or higher leads to model crashes on macOS, right at the beginning:
ESMF 8.1.0 beta snapshot 21 works just fine. I use the same compile options for bs 21,27,38 and tested both in optimized mode (on macOS this is
ESMF_BOPT=O
andESMF_OPTLEVEL="0"
, any value higher than0
has always led to crashes) and in debug mode (ESMF_BOPT=g
andESMF_OPTLEVEL="0"
).I am in contact with the ESMF developers to identify the source of this problem. Without a solution, we cannot update to beta snapshot 38 and merge https://github.com/NOAA-EMC/fv3atm/pull/180.
I understand the ESMF group runs their ESMF tests on macOS routinely without issues. Since the problem occurs with the ufs-weather-model, I am raising the issue here, although it impacts a number of GitHub repositories such as hpc-stack and esmf.