ufs-community / ufs-weather-model

UFS Weather Model
Other
141 stars 248 forks source link

Updating to ESMF 8.1.0 beta snapshot 27 or higher crashes on macOS #303

Closed climbfuji closed 3 years ago

climbfuji commented 3 years ago

Description

Updating ESMF 8.1.0 beta snapshot 27 or higher leads to model crashes on macOS, right at the beginning:

+ mpiexec.hydra -prepend-rank -n 6 ./fv3.exe
[0]
[0]
[0] * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * .
[0]      PROGRAM nems      HAS BEGUN. COMPILED       0.00     ORG: np23
[0]      STARTING DATE-TIME  NOV 26,2020  20:14:35.294  331  THU   2459180
[0]
[0]
[0]
[0] Program received signal SIGABRT: Process abort signal.
[0]
[0] Backtrace for this error:
[2]
[2] Program received signal SIGABRT: Process abort signal.
[2]
[2] Backtrace for this error:
[5]
[5] Program received signal SIGABRT: Process abort signal.
[5]
[5] Backtrace for this error:
[1]
[1] Program received signal SIGABRT: Process abort signal.
[1]
[1] Backtrace for this error:
[4]
[4] Program received signal SIGABRT: Process abort signal.
[4]
[4] Backtrace for this error:
[3]
[3] Program received signal SIGABRT: Process abort signal.
[3]
[3] Backtrace for this error:
[0] #0  0x120ff6f3d
[0] #1  0x120ff634d
[0] #2  0x7fff6230eb5c
[1] #0  0x115bdbf3d
[1] #1  0x115bdb34d
[1] #2  0x7fff6230eb5c
[2] #0  0x118ff7f3d
[2] #1  0x118ff734d
[2] #2  0x7fff6230eb5c
[3] #0  0x11d7aff3d
[3] #1  0x11d7af34d
[3] #2  0x7fff6230eb5c
[4] #0  0x115f62f3d
[4] #1  0x115f6234d
[4] #2  0x7fff6230eb5c
[5] #0  0x1160bff3d
[5] #1  0x1160bf34d
[5] #2  0x7fff6230eb5c

ESMF 8.1.0 beta snapshot 21 works just fine. I use the same compile options for bs 21,27,38 and tested both in optimized mode (on macOS this is ESMF_BOPT=O and ESMF_OPTLEVEL="0", any value higher than 0 has always led to crashes) and in debug mode (ESMF_BOPT=g and ESMF_OPTLEVEL="0").

I am in contact with the ESMF developers to identify the source of this problem. Without a solution, we cannot update to beta snapshot 38 and merge https://github.com/NOAA-EMC/fv3atm/pull/180.

I understand the ESMF group runs their ESMF tests on macOS routinely without issues. Since the problem occurs with the ufs-weather-model, I am raising the issue here, although it impacts a number of GitHub repositories such as hpc-stack and esmf.

rsdunlapiv commented 3 years ago

@climbfuji can you confirm the compiler and mpi used for the hpc-stack in this case? I tried to install the hpc-stack on our macos server, but it failed with Intel. Before fighting that battle, I'd rather just reproduce your exact build of the stack as closely as possible.

For reference, the ESMF internal ticket number for this is 3615089.

climbfuji commented 3 years ago

I am using gcc+gfortran 9.2.0 installed via homebrew (roughly following the installation guide in lines 5-50 in https://github.com/NOAA-EMC/NCEPLIBS-external/blob/develop/doc/README_macos_gccgfortran.txt).

I haven't tried clang-9.0.0+gfortran-9.2.0 yet (similarly, following lines 5-58 in https://github.com/NOAA-EMC/NCEPLIBS-external/blob/develop/doc/README_macos_clanggfortran.txt).

MPI is mpich 3.3.1-3.3.2.

Both work with bs21 (and anything earlier than that back to 7.1.0r). Since I can't get a proper stack trace with gcc+gfortran, I'll try clang+gfortran next, and then the good old strategy to add print statements to see where it fails.

If you know of a tool for macOS that works similar to addr2line on Linux, please share that knowledge with me. My google search hasn't been successful thus far.

climbfuji commented 3 years ago

Ok, with LLVM clang + GNU gfortran I get a little more information, maybe this is enough for the ESMF developers to see what is going on (unfortunately, still no information on source file or line numbers):

+ mpiexec.hydra -prepend-rank -n 6 ./fv3.exe
[0]
[0]
[0] * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * .
[0]      PROGRAM nems      HAS BEGUN. COMPILED       0.00     ORG: np23
[0]      STARTING DATE-TIME  NOV 30,2020  13:53:57.558  335  MON   2459184
[0]
[0]
[0] terminate called after throwing an instance of 'std::out_of_range'
[1] terminate called after throwing an instance of 'std::out_of_range'
[2] terminate called after throwing an instance of 'std::out_of_range'
[4] terminate called after throwing an instance of 'std::out_of_range'
[0]   what():  map::at:  key not found
[0]
[0] Program received signal SIGABRT: Process abort signal.
[0]
[0] Backtrace for this error:
[1]   what():  map::at:  key not found
[1]
[1] Program received signal SIGABRT: Process abort signal.
[1]
[1] Backtrace for this error:
[2]   what():  map::at:  key not found
[2]
[2] Program received signal SIGABRT: Process abort signal.
[2]
[2] Backtrace for this error:
[4]   what():  map::at:  key not found
[4]
[4] Program received signal SIGABRT: Process abort signal.
[4]
[4] Backtrace for this error:
[3] terminate called after throwing an instance of 'std::out_of_range'
[5] terminate called after throwing an instance of 'std::out_of_range'
[3]   what():  map::at:  key not found
[3]
[3] Program received signal SIGABRT: Process abort signal.
[3]
[3] Backtrace for this error:
[5]   what():  map::at:  key not found
[5]
[5] Program received signal SIGABRT: Process abort signal.
[5]
[5] Backtrace for this error:
[1] #0  0x11714cf3d
[1] #1  0x11714c34d
[1] #2  0x7fff6230eb5c
[2] #0  0x1192eff3d
[2] #1  0x1192ef34d
[2] #2  0x7fff6230eb5c
[3] #0  0x11fb77f3d
[3] #1  0x11fb7734d
[3] #2  0x7fff6230eb5c
[4] #0  0x11df0df3d
[4] #1  0x11df0d34d
[4] #2  0x7fff6230eb5c
[0] #0  0x118eb0f3d
[0] #1  0x118eb034d
[0] #2  0x7fff6230eb5c
[5] #0  0x1159f0f3d
[5] #1  0x1159f034d
[5] #2  0x7fff6230eb5c
climbfuji commented 3 years ago

Here is some more information. I enabled the output of all MESSAGE_CHECK lines. The last message written is Create the NEMS Import/Export States. Looking at NEMS/src/MAIN_NEMS.F90, it fails between that message and the next one, Execute the NEMS Component Initialize Step. This should narrow down our search.

      MESSAGE_CHECK="Create the NEMS Import/Export States"
     CALL ESMF_LogWrite(MESSAGE_CHECK,ESMF_LOGMSG_INFO,rc=RC)
! ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
!
      NEMS_IMP_STATE=ESMF_StateCreate(name='NEMS Import State'     &
                                     ,rc       =RC)
      ESMF_ERR_ABORT(RC)
!
      NEMS_EXP_STATE=ESMF_StateCreate(name='NEMS Export State'     &
                                     ,rc       =RC)
      ESMF_ERR_ABORT(RC)
!      
! ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
!
!-----------------------------------------------------------------------
!***  Execute the INITIALIZE step for the NEMS component.
!***  The Initialize routine that is called here as well as the
!***  Run and Finalize routines invoked below are those specified
!***  in the Register routine called in ESMF_GridCompSetServices above.
!-----------------------------------------------------------------------
!
! ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
      MESSAGE_CHECK="Execute the NEMS Component Initialize Step"
rsdunlapiv commented 3 years ago

@climbfuji thanks for helping to narrow this down.

I am trying to build the hpc-stack on our mac server. I am using gnu 8.4.0 and I get a failure building the PIO package (see below).

I am building with:

./build_stack.sh -p $PWD/install-gnu -c config/config_mac.sh -y config/stack_mac.yaml

The error:

[ 77%] Linking C executable darray_no_async
cd /project/esmf/rocky/ufs/hpc-stack/pkg/pio-2.5.1/build/examples/c && /Volumes/esmf/rocky/ufs/cmake-3.19.1-Darwin-x86_64/CMake.app/Contents/bin/cmake -E cmake_link_script CMakeFiles/darray_no_async.dir/link.txt --verbose=1
[ 83%] Building Fortran object src/flib/CMakeFiles/piof.dir/pio_support.F90.o
[ 83%] Building Fortran object src/flib/CMakeFiles/piof.dir/pio_types.F90.o
cd /project/esmf/rocky/ufs/hpc-stack/pkg/pio-2.5.1/build/src/flib && /project/esmf/rocky/ufs/hpc-stack/install-gnu/bin/mpifort -DCPRGNU -DDARWIN -DLOGGING -DNETCDF_C_LOGGING_ENABLED -D_NETCDF -D_NOPNETCDF -I/project/esmf/rocky/ufs/hpc-stack/pkg/pio-2.5.1/build -I/project/esmf/rocky/ufs/hpc-stack/pkg/pio-2.5.1/src/flib -I/project/esmf/rocky/ufs/hpc-stack/pkg/pio-2.5.1/build/src/flib -I/usr/local/include -I/project/esmf/rocky/ufs/hpc-stack/pkg/pio-2.5.1/src/clib -fPIC -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk -ffree-line-length-none -c /project/esmf/rocky/ufs/hpc-stack/pkg/pio-2.5.1/src/flib/pio_support.F90 -o CMakeFiles/piof.dir/pio_support.F90.o
cd /project/esmf/rocky/ufs/hpc-stack/pkg/pio-2.5.1/build/src/flib && /project/esmf/rocky/ufs/hpc-stack/install-gnu/bin/mpifort -DCPRGNU -DDARWIN -DLOGGING -DNETCDF_C_LOGGING_ENABLED -D_NETCDF -D_NOPNETCDF -I/project/esmf/rocky/ufs/hpc-stack/pkg/pio-2.5.1/build -I/project/esmf/rocky/ufs/hpc-stack/pkg/pio-2.5.1/src/flib -I/project/esmf/rocky/ufs/hpc-stack/pkg/pio-2.5.1/build/src/flib -I/usr/local/include -I/project/esmf/rocky/ufs/hpc-stack/pkg/pio-2.5.1/src/clib -fPIC -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk -ffree-line-length-none -c /project/esmf/rocky/ufs/hpc-stack/pkg/pio-2.5.1/src/flib/pio_types.F90 -o CMakeFiles/piof.dir/pio_types.F90.o
/project/esmf/rocky/ufs/hpc-stack/install-gnu/bin/mpicc -fPIC -std=c99 -g -O0 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk -Wl,-search_paths_first -Wl,-headerpad_max_install_names -L/project/esmf/rocky/ufs/hpc-stack/install-gnu/lib  -lz -ldl -lm CMakeFiles/example1.dir/example1.c.o -o example1  ../../src/clib/libpioc.a /usr/local/lib/libnetcdf.dylib 
/project/esmf/rocky/ufs/hpc-stack/install-gnu/bin/mpicc -fPIC -std=c99 -g -O0 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk -Wl,-search_paths_first -Wl,-headerpad_max_install_names -L/project/esmf/rocky/ufs/hpc-stack/install-gnu/lib  -lz -ldl -lm CMakeFiles/darray_no_async.dir/darray_no_async.c.o -o darray_no_async  ../../src/clib/libpioc.a /usr/local/lib/libnetcdf.dylib 
Undefined symbols for architecture x86_64:
  "_MPI_Comm_f2c", referenced from:
      _PIOc_Init_Intracomm_from_F90 in libpioc.a(pioc.c.o)
      _PIOc_readmap_from_f90 in libpioc.a(pioc_support.c.o)
      _PIOc_writemap_from_f90 in libpioc.a(pioc_support.c.o)
  "_ompi_mpi_byte", referenced from:
      _find_mpi_type in libpioc.a(pioc_support.c.o)
      _PIOc_put_att_tc in libpioc.a(pio_getput_int.c.o)
      _PIOc_get_att_tc in libpioc.a(pio_getput_int.c.o)
      _PIOc_get_vars_tc in libpioc.a(pio_getput_int.c.o)
      _PIOc_put_vars_tc in libpioc.a(pio_getput_int.c.o)
      _att_put_handler in libpioc.a(pio_msg.c.o)
      _put_vars_handler in libpioc.a(pio_msg.c.o)
      ...
  Undefined" _ompi_mpi_char"symbols for, referenced  from:
   architecture x86_64:
   "  _MPI_Comm_f2c", _PIOc_set_iosystem_error_handling referencedin libpioc.a (from:pioc.c.o
) 
           _PIOc_InitDecomp in _PIOc_Init_Intracomm_from_F90libpioc.a(pioc.c.o) 
  in     libpioc.a(_PIOc_deletefilepioc.c.o )
 in    libpioc.a(pio_file.c.o ) 
 _PIOc_readmap_from_f90      _PIOc_inq inin libpioc.a( pioc_support.c.olibpioc.a()pio_nc.c.o
)    
   _PIOc_writemap_from_f90  in     _PIOc_inq_unlimdims libpioc.ain (libpioc.a(pioc_support.c.o)pio_nc.c.o)

    " _ompi_mpi_byte   _PIOc_inq_type",  referencedin  libpioc.a(from:pio_nc.c.o)

         _PIOc_inq_format in    libpioc.a_find_mpi_type( in pio_nc.c.olibpioc.a)(
pioc_support.c.o)
        _PIOc_put_att_tc     .in ..
libpioc.a(pio_getput_int.c.o ) "
 _ompi_mpi_comm_null " , referenced from :  
 _PIOc_get_att_tc      in _PIOc_iosystem_is_activelibpioc.a (in libpioc.a(pio_getput_int.c.o)
pioc.c.o )   
       _PIOc_get_vars_tc  in libpioc.a(pio_getput_int.c.o)
  _PIOc_Init_Intracomm    in  libpioc.a(pioc.c.o_PIOc_put_vars_tc) 
in libpioc.a(      pio_getput_int.c.o)
_PIOc_free_iosystem   in    _att_put_handler in libpioc.a( pio_msg.c.o)libpioc.a
(    pioc.c.o  _put_vars_handler )in
       _PIOc_init_async inlibpioc.a (libpioc.a(pio_msg.c.opioc.c.o))

    ...
   " _ompi_mpi_char"",_ompi_mpi_comm_world "referenced, from:
   referenced from :
      _PIOc_set_iosystem_error_handling   in  libpioc.a(_mainpioc.c.o )in
     darray_no_async.c.o
        _piodie_PIOc_InitDecomp in libpioc.a(pioc_support.c.o )
 in libpioc.a  (   pioc.c.o_pio_err) 
in     libpioc.a  _PIOc_deletefile( in pioc_support.c.o)
  libpioc.a(pio_file.c.o)
      _PIOc_inq in libpioc.a(pio_nc.c.o")_ompi_mpi_datatype_null
   " ,   referenced_PIOc_inq_unlimdims  fromin :libpioc.a(pio_nc.c.o)

        _PIOc_inq_type in  libpioc.a(pio_nc.c.o  )_PIOc_def_var
       in _PIOc_inq_formatlibpioc.a (inpio_nc.c.o libpioc.a()pio_nc.c.o
   )  
       _malloc_iodesc...
   "in_ompi_mpi_comm_null" libpioc.a,( pioc_support.c.oreferenced from:)

            _inq_file_metadata in libpioc.a_PIOc_iosystem_is_active in (pioc_support.c.o)
  "libpioc.a_ompi_mpi_double"(pioc.c.o), 
referenced   from : 
      _PIOc_Init_Intracomm in  libpioc.a(pioc.c.o )
 _find_mpi_type  in libpioc.a (  pioc_support.c.o )
  "_ompi_mpi_errors_return"_PIOc_free_iosystem,  referencedin  from:
libpioc.a     ( _main pioc.c.oin) darray_no_async.c.o
  "
 _ompi_mpi_float ",     referenced_PIOc_init_async  from:in
      libpioc.a (_find_mpi_type in libpioc.a(pioc.c.o)pioc_support.c.o
) 
      "_ompi_mpi_comm_world" ,_set_var_chunk_cache_handler  referenced from:
 in    libpioc.a ( pio_msg.c.o_main)
    in    example1.c.o
_set_chunk_cache_handler in  libpioc.a   (pio_msg.c.o)
      _piodie  _PIOc_set_chunk_cache in  libpioc.a(inpio_nc4.c.o libpioc.a()pioc_support.c.o)

           _pio_err  in _PIOc_get_chunk_cache libpioc.ain( libpioc.apioc_support.c.o)
(  "pio_nc4.c.o)
_ompi_mpi_datatype_null " , referenced  from: 
      _PIOc_set_var_chunk_cache  in libpioc.a( pio_nc4.c.o)_PIOc_def_var in
    libpioc.a ( pio_nc.c.o)
       _malloc_iodesc_PIOc_get_var_chunk_cache  in inlibpioc.a(pio_nc4.c.o) 
 libpioc.a  (   pioc_support.c.o.)..

   "_inq_file_metadata_ompi_mpi_group_null "in,  libpioc.areferenced( pioc_support.c.ofrom):

    "    _ompi_mpi_double",_PIOc_Init_Intracomm  referenced from:in
       libpioc.a_find_mpi_type(pioc.c.o )in 
libpioc.a (pioc_support.c.o )"
 _ompi_mpi_info_null ""_ompi_mpi_errors_return", referenced,  from:
referenced   from:
       _PIOc_Init_Intracomm  in  _main in libpioc.a (pioc.c.o)
  example1.c.o    _PIOc_set_hint in libpioc.a(
  "pioc.c.o_ompi_mpi_float)"
,   referenced    from:
       _find_mpi_type in libpioc.a(_PIOc_free_iosystempioc_support.c.o) 
   in libpioc.a ( pioc.c.o)
   _set_var_chunk_cache_handler   in   libpioc.a(_PIOc_init_asyncpio_msg.c.o )
      _set_chunk_cache_handler inin libpioc.a (libpioc.a(pio_msg.c.opioc.c.o)
) 
  "  _ompi_mpi_int",   referenced from:
    _PIOc_set_chunk_cache in   libpioc.a (pio_nc4.c.o)_PIOc_advanceframe 
in      libpioc.a (pioc.c.o)_PIOc_get_chunk_cache 
      in_PIOc_setframe in libpioc.a( libpioc.a(pioc.c.opio_nc4.c.o)
  )
          _PIOc_set_iosystem_error_handling in libpioc.a(pioc.c.o)
_PIOc_set_var_chunk_cache in       libpioc.a_PIOc_InitDecomp(pio_nc4.c.o)
    in   _PIOc_get_var_chunk_cache in  libpioc.a(pio_nc4.c.o)
libpioc.a(pioc.c.o ) 
          ._PIOc_free_iosystem ..in
 libpioc.a( pioc.c.o)
    " _ompi_mpi_group_null",   _PIOc_closefilereferenced  from:in libpioc.a(
pio_file.c.o)
       _PIOc_Init_Intracomm in libpioc.a(pioc.c.o )    
  "_PIOc_deletefile _ompi_mpi_info_null"in ,libpioc.a (referenced pio_file.c.ofrom):

          ... 
 _PIOc_Init_Intracomm in  "libpioc.a(_ompi_mpi_long", pioc.c.oreferenced) from:

         _PIOc_set_hint in libpioc.a_cn_buffer_report( in pioc.c.o)
libpioc.a(  pio_darray_int.c.o )
  "_ompi_mpi_offset", referenced    from:
      _PIOc_InitDecomp in_PIOc_free_iosystem  libpioc.a(pioc.c.oin )libpioc.a
(pioc.c.o)
            _PIOc_inq_type _PIOc_init_asyncin  in libpioc.alibpioc.a((pio_nc.c.opioc.c.o)
  )"_ompi_mpi_int"
 ,     referenced from: 
   _PIOc_inq_dim   _PIOc_advanceframe  in inlibpioc.a (pio_nc.c.olibpioc.a()pioc.c.o
  )    
  _PIOc_inq_att_eh   in   _PIOc_setframe libpioc.ain libpioc.a(pioc.c.o)(pio_nc.c.o)

        _PIOc_def_var   in   libpioc.a(pio_nc.c.o)_PIOc_set_iosystem_error_handling
  in    libpioc.a (pioc.c.o)
   _PIOc_def_var_fill in libpioc.a (   _PIOc_InitDecomppio_nc.c.o )
 in     libpioc.a (_PIOc_inq_var_fillpioc.c.o )in 
libpioc.a ( pio_nc.c.o ) 
   _PIOc_free_iosystem  in     ...libpioc.a
(  "_ompi_mpi_op_maxpioc.c.o", )referenced 
 from :
        _PIOc_write_nc_decomp   in _PIOc_closefile in libpioc.alibpioc.a((pio_file.c.opioc_support.c.o))

       _PIOc_write_darray in      _PIOc_deletefile libpioc.a(pio_darray.c.o)in
       libpioc.a(_compute_maxIObuffersize pio_file.c.oin)
 libpioc.a( pio_rearrange.c.o     )
... 
      _subset_rearrange_create  in" _ompi_mpi_long", referenced from:
libpioc.a (pio_rearrange.c.o  ) 
       _cn_buffer_report in libpioc.a( pio_darray_int.c.o)
  _cn_buffer_report "in _ompi_mpi_op_minlibpioc.a(pio_darray_int.c.o)
"  ," _ompi_mpi_offset", referenced referencedfrom :
from : 
          _check_netcdf2_PIOc_InitDecomp  in libpioc.a(pioc_support.c.oin)
 libpioc.a(pioc.c.o ) 
         _cn_buffer_report  in_PIOc_inq_type  libpioc.a(in libpioc.a(pio_darray_int.c.o)
 pio_nc.c.o     _compute_maxaggregate_bytes in libpioc.a()pio_darray_int.c.o)

      " _PIOc_inq_dim in _ompi_mpi_op_sumlibpioc.a", referenced( from:
 pio_nc.c.o)
           _PIOc_inq_att_eh in _determine_filllibpioc.a (inpio_nc.c.o)
    libpioc.a(pio_rearrange.c.o)
  "   _ompi_mpi_short", _PIOc_def_var in libpioc.a(referencedpio_nc.c.o) 
    from : 
      _find_mpi_type_PIOc_def_var_fill  in inlibpioc.a libpioc.a(pio_nc.c.o)
 ( pioc_support.c.o ) 
    "_PIOc_inq_var_fill _ompi_request_null", inreferenced from:
      libpioc.a (_pio_swapm in pio_nc.c.olibpioc.a(pio_spmd.c.o))
ld: 
   symbol(  s )...
 not   "found _ompi_mpi_op_max"for ,architecture  x86_64referenced from
:
      _PIOc_write_nc_decomp in libpioc.a(pioc_support.c.o)
      _PIOc_write_darray in libpioc.a(pio_darray.c.o)
      _compute_maxIObuffersize in libpioc.a(pio_rearrange.c.o)
      _subset_rearrange_create in libpioc.a(pio_rearrange.c.o)
      _cn_buffer_report in libpioc.a(pio_darray_int.c.o)
  "_ompi_mpi_op_min", referenced from:
      _check_netcdf2 in libpioc.a(pioc_support.c.o)
      _cn_buffer_report in libpioc.a(pio_darray_int.c.o)
      _compute_maxaggregate_bytes in collect2: error: ld returned 1 exit status
libpioc.a(pio_darray_int.c.o)
  "_ompi_mpi_op_sum", referenced from:
      _determine_fill in libpioc.a(pio_rearrange.c.o)
  "_ompi_mpi_short", referenced from:
      _find_mpi_type in libpioc.a(pioc_support.c.o)
  "_ompi_request_null", referenced from:
      _pio_swapm in libpioc.a(pio_spmd.c.o)
ld: symbol(s) not found for architecture x86_64
collect2: error: ld returned 1 exit status
make[2]: *** [examples/c/darray_no_async] Error 1
make[2]: *** [examples/c/example1] Error 1
make[1]: *** [examples/c/CMakeFiles/example1.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....
make[1]: /Volumes/esmf/rocky/ufs/cmake-3.19.1-Darwin-x86_64/CMake.app/Contents/bin/cmake -E touch src/flib/CMakeFiles/piof.dir/pio_kinds.F90.o.provides.build
*** [examples/c/CMakeFiles/darray_no_async.dir/all] Error 2

@jedwards4b do you have any insights?

kgerheiser commented 3 years ago

Hey Rocky,

I just built with GCC 8.4 and it worked, mostly.

I had to disable ECKIT, FCKIT, and ATLAS, which are unimportant. Don't know why you ran into PIO issues.

Here's my config_mac.sh

#!/bin/bash

# Compiler/MPI combination
export HPC_COMPILER="gnu/8.4.0"
export HPC_MPI="mpich/3.3.1"

# Build options
export USE_SUDO=N
export PKGDIR=pkg
export LOGDIR=log
export OVERWRITE=Y
export NTHREADS=20
export   MAKE_CHECK=N
export MAKE_VERBOSE=N
export   MAKE_CLEAN=N
export DOWNLOAD_ONLY=N
export STACK_EXIT_ON_FAIL=Y
export WGET="wget -nv"

export SERIAL_FC=gfortran-mp-8

Then, I ran ../build_stack.sh -c config/config_mac.sh -p /Users/KIG/Desktop/hpc-stack/install -y config/stack_mac.yaml

Looks like you're building with OpenMPI? I used MPICH and built it from source using hpc-stack.

rsdunlapiv commented 3 years ago

Those do appear to be OpenMPI symbols. However, I asked the hpc-stack to build mpich itself - and it appears that it did, and put the binaries (e.g., mpicxx) under install-gnu/bin and those appear to be used in the compile of PIO. Not sure where/why it is picking up OpenMPI...

rsdunlapiv commented 3 years ago

I am curious about what this -isysroot is doing here: /project/esmf/rocky/ufs/hpc-stack/install-gnu/bin/mpicc -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk

climbfuji commented 3 years ago

I am curious about what this -isysroot is doing here: /project/esmf/rocky/ufs/hpc-stack/install-gnu/bin/mpicc -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk

You don't need PIO at all, just remove it from the stack_mac.yaml file (or set build to NO). Also remove everything starting from boost to the end of the file.

I got more information as well. The model run crashes in the import state, doesn't even get to the export state.

rsdunlapiv commented 3 years ago

@climbfuji okay, the stack built with PIO and the other libs removed. I'll try the model build next. Do you have a run directory handy?

climbfuji commented 3 years ago

I can give you one, certainly. But I doubt the model will build with gnu 8.x.y - I remember that we had to make it a requirement to use gnu 9.m.n because of some Fortran 2008 features in the code that gnu 8 does not (!) support. But you can remove the guard if you come across it and see if it does build with your particular version.

rsdunlapiv commented 3 years ago

I don't have gcc9 on our server, but I do have gcc10. Is 10 expected to work?

climbfuji commented 3 years ago

I don't have gcc9 on our server, but I do have gcc10. Is 10 expected to work?

That's a good one ;-) It will compile the stack and the model. The model will (would, better to say) crash at some point in the physics, but given that you won't get there it may do the job.

I'll try one other thing today, and that is using the native clang + gfortran instead of LLVM clang + gfortran. Maybe I can get a stack trace this way.

jedwards4b commented 3 years ago

To answer @rsdunlapiv regarding the pio internal to ESMF - this pio version is older than openmpi and cannot be expected to support that library - what is the timeine to bring a modern version of pio into the esmf library?

rsdunlapiv commented 3 years ago

@jedwards4b this is an externally built PIO, version 2.5.1. Since I don't need it for the immediate problem (the weather model is not using PIO in this config), there is no need to look into this right now.

jedwards4b commented 3 years ago

Thanks for the clarification - so the problem is that the build is mixing openmpi and mpich libraries somehow.

rsdunlapiv commented 3 years ago

@climbfuji you are right about gnu 8.x not being supported by the ufs-weather-model. I am setting up gcc@9 and rebuilding the stack and model.

rsdunlapiv commented 3 years ago

@climbfuji I now have an hpc-stack and ufs_model executable for gcc9.3. Can you please let me know where that run directory is located?

climbfuji commented 3 years ago

@climbfuji I now have an hpc-stack and ufs_model executable for gcc9.3. Can you please let me know where that run directory is located?

Uploading one for you to Cheyenne, will let you know when it's up there. It's a fully self-contained test case running GFS v16beta using 6 MPI tasks (one per tile). If your machine has 16GB of memory, you can easily run this. It runs to completion with bs21 on my Mac. In that directory, all you need to do is edit run_macosx.sh and set the path

FV3_BUILD_DIR=/Users/dom.heinzeller/scratch/ufs-weather-model/ufs-weather-model-timestep-init-finalize/llvm

to point to your top-level ufs-weather-model directory (so that $BUILD_DIR/tests/fv3.exe can be found).

The script doesn't set any environment variables, so make sure that any PATH, LD_LIBRARY_PATH etc. environment variables are set correctly in your shell (if applicable).

climbfuji commented 3 years ago

@climbfuji I now have an hpc-stack and ufs_model executable for gcc9.3. Can you please let me know where that run directory is located?

Uploading one for you to Cheyenne, will let you know when it's up there. It's a fully self-contained test case running GFS v16beta using 6 MPI tasks (one per tile). If your machine has 16GB of memory, you can easily run this. It runs to completion with bs21 on my Mac. In that directory, all you need to do is edit run_macosx.sh and set the path

FV3_BUILD_DIR=/Users/dom.heinzeller/scratch/ufs-weather-model/ufs-weather-model-timestep-init-finalize/llvm

to point to your top-level ufs-weather-model directory (so that $BUILD_DIR/tests/fv3.exe can be found).

The script doesn't set any environment variables, so make sure that any PATH, LD_LIBRARY_PATH etc. environment variables are set correctly in your shell (if applicable).

@kgerheiser @aerorahul FYI, you wanted something like that, too.

climbfuji commented 3 years ago

Here it is: /glade/work/heinzell/rundir_fv3_ccpp_gfsv16beta_20201203

kgerheiser commented 3 years ago

We don't have access to Cheyenne :)

aerorahul commented 3 years ago

us lowly humans are stuck to NOAA machines.

rsdunlapiv commented 3 years ago

I would love the regression test framework to be able to pull down that configured run directory from cloud storage anywhere with internet access. We are living in the future, so I don't see why that would be so hard. ;)

climbfuji commented 3 years ago

I would love the regression test framework to be able to pull down that configured run directory from cloud storage anywhere with internet access. We are living in the future, so I don't see why that would be so hard. ;)

It's coming, actually. I do have the s3 bucket set up and ready to go, just need time to implement something like that.

climbfuji commented 3 years ago

us lowly humans are stuck to NOAA machines.

I consider you as the privileged ones having access to wcoss! Here you go:

/scratch1/BMC/gmtb/Dom.Heinzeller/rundir_fv3_ccpp_gfsv16beta_20201203
rsdunlapiv commented 3 years ago

I used ufs-weather-model CMake followed by make install. I do not have a $BUILD_DIR/tests/fv3.exe but I do have a $BUILD_DIR/install/bin/ufs_model. Is that okay or should I build this a different way?

climbfuji commented 3 years ago

I used ufs-weather-model CMake followed by make install. I do not have a $BUILD_DIR/tests/fv3.exe but I do have a $BUILD_DIR/install/bin/ufs_model. Is that okay or should I build this a different way?

Yes, should do. But do you have the correct suite compiled into the executable is the question. Compile like this:

cd tests
./compile.sh macosx.gnu 'CCPP=Y DEBUG=Y' '' NO NO 2>&1 | tee compile.log
rsdunlapiv commented 3 years ago

That is giving me a CMake error:

cgdm-catania:tests dunlap$ pwd
/project/esmf/rocky/ufs/ufs-weather-model/tests
cgdm-catania:tests dunlap$ ./compile.sh macosx.gnu 'CCPP=Y DEBUG=Y' '' NO NO 2>&1 | tee compile.log
+ SECONDS=0
++ uname -s
+ [[ Darwin == Darwin ]]
++++ greadlink -f -n ./compile.sh
./compile.sh: line 16: greadlink: command not found
+++ dirname ''
++ cd .
++ pwd -P
+ readonly MYDIR=/Volumes/esmf/rocky/ufs/ufs-weather-model/tests
+ MYDIR=/Volumes/esmf/rocky/ufs/ufs-weather-model/tests
+ readonly ARGC=5
+ ARGC=5
+ [[ 5 -lt 2 ]]
+ MACHINE_ID=macosx.gnu
+ MAKE_OPT='CCPP=Y DEBUG=Y'
+ BUILD_NAME=fv3
+ clean_before=NO
+ clean_after=NO
++ cd /Volumes/esmf/rocky/ufs/ufs-weather-model/tests/..
++ pwd
+ PATHTR=/Volumes/esmf/rocky/ufs/ufs-weather-model
++ pwd
+ BUILD_DIR=/project/esmf/rocky/ufs/ufs-weather-model/tests/build_fv3
+ [[ macosx.gnu == cheyenne.* ]]
+ [[ macosx.gnu == wcoss_dell_p3 ]]
+ BUILD_JOBS=8
+ hostname
cgdm-catania
+ set +x
Setting environment variables for NEMSfv3gfs on MACOSX with gcc/gfortran or clang/gfortran
+ echo 'Compiling CCPP=Y DEBUG=Y into fv3.exe on macosx.gnu'
Compiling CCPP=Y DEBUG=Y into fv3.exe on macosx.gnu
+ CMAKE_FLAGS=
+ [[ CCPP=Y DEBUG=Y == *\D\E\B\U\G\=\Y* ]]
+ CMAKE_FLAGS=' -DDEBUG=Y'
+ [[ CCPP=Y DEBUG=Y == *\3\2\B\I\T\=\Y* ]]
+ [[ CCPP=Y DEBUG=Y == *\O\P\E\N\M\P\=\N* ]]
+ [[ CCPP=Y DEBUG=Y == *\M\U\L\T\I\_\G\A\S\E\S\=\Y* ]]
+ CMAKE_FLAGS=' -DDEBUG=Y -DMULTI_GASES=OFF'
+ [[ CCPP=Y DEBUG=Y == *\C\C\P\P\=\Y* ]]
+ mkdir -p /Volumes/esmf/rocky/ufs/ufs-weather-model/FV3/ccpp/include
+ mkdir -p /Volumes/esmf/rocky/ufs/ufs-weather-model/FMS/fms2_io/include
+ CMAKE_FLAGS=' -DDEBUG=Y -DMULTI_GASES=OFF -DCCPP=ON -DMPI=ON'
+ [[ CCPP=Y DEBUG=Y == *\D\E\B\U\G\=\Y* ]]
+ CMAKE_FLAGS=' -DDEBUG=Y -DMULTI_GASES=OFF -DCCPP=ON -DMPI=ON -DCMAKE_BUILD_TYPE=Debug'
+ [[ CCPP=Y DEBUG=Y == *\3\2\B\I\T\=\Y* ]]
+ CMAKE_FLAGS=' -DDEBUG=Y -DMULTI_GASES=OFF -DCCPP=ON -DMPI=ON -DCMAKE_BUILD_TYPE=Debug -DDYN32=OFF'
+ set +ex
+ [[ CCPP=Y DEBUG=Y == *\W\W\3\=\Y* ]]
+ [[ CCPP=Y DEBUG=Y == *\S\2\S\=\Y* ]]
+ [[ CCPP=Y DEBUG=Y == *\D\A\T\M\=\Y* ]]
++ trim ' -DDEBUG=Y -DMULTI_GASES=OFF -DCCPP=ON -DMPI=ON -DCMAKE_BUILD_TYPE=Debug -DDYN32=OFF'
++ local 'var= -DDEBUG=Y -DMULTI_GASES=OFF -DCCPP=ON -DMPI=ON -DCMAKE_BUILD_TYPE=Debug -DDYN32=OFF'
++ var='-DDEBUG=Y -DMULTI_GASES=OFF -DCCPP=ON -DMPI=ON -DCMAKE_BUILD_TYPE=Debug -DDYN32=OFF'
++ var='-DDEBUG=Y -DMULTI_GASES=OFF -DCCPP=ON -DMPI=ON -DCMAKE_BUILD_TYPE=Debug -DDYN32=OFF'
++ echo -n '-DDEBUG=Y -DMULTI_GASES=OFF -DCCPP=ON -DMPI=ON -DCMAKE_BUILD_TYPE=Debug -DDYN32=OFF'
+ CMAKE_FLAGS='-DDEBUG=Y -DMULTI_GASES=OFF -DCCPP=ON -DMPI=ON -DCMAKE_BUILD_TYPE=Debug -DDYN32=OFF'
+ '[' NO = YES ']'
+ export BUILD_VERBOSE=1
+ BUILD_VERBOSE=1
+ export BUILD_DIR
+ export BUILD_JOBS
+ export CCPP_SUITES
+ export CMAKE_FLAGS
+ bash -x /Volumes/esmf/rocky/ufs/ufs-weather-model/build.sh
+ set -eu
++ uname -s
+ [[ Darwin == Darwin ]]
++++ greadlink -f -n /Volumes/esmf/rocky/ufs/ufs-weather-model/build.sh
/Volumes/esmf/rocky/ufs/ufs-weather-model/build.sh: line 5: greadlink: command not found
+++ dirname ''
++ cd .
++ pwd -P
+ readonly UFS_MODEL_DIR=/Volumes/esmf/rocky/ufs/ufs-weather-model/tests
+ UFS_MODEL_DIR=/Volumes/esmf/rocky/ufs/ufs-weather-model/tests
+ export CMAKE_C_COMPILER=mpicc
+ CMAKE_C_COMPILER=mpicc
+ export CMAKE_CXX_COMPILER=mpicxx
+ CMAKE_CXX_COMPILER=mpicxx
+ export CMAKE_Fortran_COMPILER=mpifort
+ CMAKE_Fortran_COMPILER=mpifort
+ export NETCDF=/project/esmf/rocky/ufs/hpc-stack/install-gnu9
+ NETCDF=/project/esmf/rocky/ufs/hpc-stack/install-gnu9
+ export ESMFMKFILE=/project/esmf/rocky/ufs/hpc-stack/install-gnu9/lib/esmf.mk
+ ESMFMKFILE=/project/esmf/rocky/ufs/hpc-stack/install-gnu9/lib/esmf.mk
+ BUILD_DIR=/project/esmf/rocky/ufs/ufs-weather-model/tests/build_fv3
+ mkdir -p /project/esmf/rocky/ufs/ufs-weather-model/tests/build_fv3
+ [[ -n '' ]]
+ CMAKE_FLAGS+=' -DNETCDF_DIR=/project/esmf/rocky/ufs/hpc-stack/install-gnu9'
+ cd /project/esmf/rocky/ufs/ufs-weather-model/tests/build_fv3
+ cmake /Volumes/esmf/rocky/ufs/ufs-weather-model/tests -DDEBUG=Y -DMULTI_GASES=OFF -DCCPP=ON -DMPI=ON -DCMAKE_BUILD_TYPE=Debug -DDYN32=OFF -DNETCDF_DIR=/project/esmf/rocky/ufs/hpc-stack/install-gnu9
CMake Error: The source directory "/project/esmf/rocky/ufs/ufs-weather-model/tests" does not appear to contain CMakeLists.txt.
climbfuji commented 3 years ago

That is giving me a CMake error:

cgdm-catania:tests dunlap$ pwd
/project/esmf/rocky/ufs/ufs-weather-model/tests
cgdm-catania:tests dunlap$ ./compile.sh macosx.gnu 'CCPP=Y DEBUG=Y' '' NO NO 2>&1 | tee compile.log
+ SECONDS=0
++ uname -s
+ [[ Darwin == Darwin ]]
++++ greadlink -f -n ./compile.sh
./compile.sh: line 16: greadlink: command not found
+++ dirname ''
++ cd .
++ pwd -P
+ readonly MYDIR=/Volumes/esmf/rocky/ufs/ufs-weather-model/tests
+ MYDIR=/Volumes/esmf/rocky/ufs/ufs-weather-model/tests
+ readonly ARGC=5
+ ARGC=5
+ [[ 5 -lt 2 ]]
+ MACHINE_ID=macosx.gnu
+ MAKE_OPT='CCPP=Y DEBUG=Y'
+ BUILD_NAME=fv3
+ clean_before=NO
+ clean_after=NO
++ cd /Volumes/esmf/rocky/ufs/ufs-weather-model/tests/..
++ pwd
+ PATHTR=/Volumes/esmf/rocky/ufs/ufs-weather-model
++ pwd
+ BUILD_DIR=/project/esmf/rocky/ufs/ufs-weather-model/tests/build_fv3
+ [[ macosx.gnu == cheyenne.* ]]
+ [[ macosx.gnu == wcoss_dell_p3 ]]
+ BUILD_JOBS=8
+ hostname
cgdm-catania
+ set +x
Setting environment variables for NEMSfv3gfs on MACOSX with gcc/gfortran or clang/gfortran
+ echo 'Compiling CCPP=Y DEBUG=Y into fv3.exe on macosx.gnu'
Compiling CCPP=Y DEBUG=Y into fv3.exe on macosx.gnu
+ CMAKE_FLAGS=
+ [[ CCPP=Y DEBUG=Y == *\D\E\B\U\G\=\Y* ]]
+ CMAKE_FLAGS=' -DDEBUG=Y'
+ [[ CCPP=Y DEBUG=Y == *\3\2\B\I\T\=\Y* ]]
+ [[ CCPP=Y DEBUG=Y == *\O\P\E\N\M\P\=\N* ]]
+ [[ CCPP=Y DEBUG=Y == *\M\U\L\T\I\_\G\A\S\E\S\=\Y* ]]
+ CMAKE_FLAGS=' -DDEBUG=Y -DMULTI_GASES=OFF'
+ [[ CCPP=Y DEBUG=Y == *\C\C\P\P\=\Y* ]]
+ mkdir -p /Volumes/esmf/rocky/ufs/ufs-weather-model/FV3/ccpp/include
+ mkdir -p /Volumes/esmf/rocky/ufs/ufs-weather-model/FMS/fms2_io/include
+ CMAKE_FLAGS=' -DDEBUG=Y -DMULTI_GASES=OFF -DCCPP=ON -DMPI=ON'
+ [[ CCPP=Y DEBUG=Y == *\D\E\B\U\G\=\Y* ]]
+ CMAKE_FLAGS=' -DDEBUG=Y -DMULTI_GASES=OFF -DCCPP=ON -DMPI=ON -DCMAKE_BUILD_TYPE=Debug'
+ [[ CCPP=Y DEBUG=Y == *\3\2\B\I\T\=\Y* ]]
+ CMAKE_FLAGS=' -DDEBUG=Y -DMULTI_GASES=OFF -DCCPP=ON -DMPI=ON -DCMAKE_BUILD_TYPE=Debug -DDYN32=OFF'
+ set +ex
+ [[ CCPP=Y DEBUG=Y == *\W\W\3\=\Y* ]]
+ [[ CCPP=Y DEBUG=Y == *\S\2\S\=\Y* ]]
+ [[ CCPP=Y DEBUG=Y == *\D\A\T\M\=\Y* ]]
++ trim ' -DDEBUG=Y -DMULTI_GASES=OFF -DCCPP=ON -DMPI=ON -DCMAKE_BUILD_TYPE=Debug -DDYN32=OFF'
++ local 'var= -DDEBUG=Y -DMULTI_GASES=OFF -DCCPP=ON -DMPI=ON -DCMAKE_BUILD_TYPE=Debug -DDYN32=OFF'
++ var='-DDEBUG=Y -DMULTI_GASES=OFF -DCCPP=ON -DMPI=ON -DCMAKE_BUILD_TYPE=Debug -DDYN32=OFF'
++ var='-DDEBUG=Y -DMULTI_GASES=OFF -DCCPP=ON -DMPI=ON -DCMAKE_BUILD_TYPE=Debug -DDYN32=OFF'
++ echo -n '-DDEBUG=Y -DMULTI_GASES=OFF -DCCPP=ON -DMPI=ON -DCMAKE_BUILD_TYPE=Debug -DDYN32=OFF'
+ CMAKE_FLAGS='-DDEBUG=Y -DMULTI_GASES=OFF -DCCPP=ON -DMPI=ON -DCMAKE_BUILD_TYPE=Debug -DDYN32=OFF'
+ '[' NO = YES ']'
+ export BUILD_VERBOSE=1
+ BUILD_VERBOSE=1
+ export BUILD_DIR
+ export BUILD_JOBS
+ export CCPP_SUITES
+ export CMAKE_FLAGS
+ bash -x /Volumes/esmf/rocky/ufs/ufs-weather-model/build.sh
+ set -eu
++ uname -s
+ [[ Darwin == Darwin ]]
++++ greadlink -f -n /Volumes/esmf/rocky/ufs/ufs-weather-model/build.sh
/Volumes/esmf/rocky/ufs/ufs-weather-model/build.sh: line 5: greadlink: command not found
+++ dirname ''
++ cd .
++ pwd -P
+ readonly UFS_MODEL_DIR=/Volumes/esmf/rocky/ufs/ufs-weather-model/tests
+ UFS_MODEL_DIR=/Volumes/esmf/rocky/ufs/ufs-weather-model/tests
+ export CMAKE_C_COMPILER=mpicc
+ CMAKE_C_COMPILER=mpicc
+ export CMAKE_CXX_COMPILER=mpicxx
+ CMAKE_CXX_COMPILER=mpicxx
+ export CMAKE_Fortran_COMPILER=mpifort
+ CMAKE_Fortran_COMPILER=mpifort
+ export NETCDF=/project/esmf/rocky/ufs/hpc-stack/install-gnu9
+ NETCDF=/project/esmf/rocky/ufs/hpc-stack/install-gnu9
+ export ESMFMKFILE=/project/esmf/rocky/ufs/hpc-stack/install-gnu9/lib/esmf.mk
+ ESMFMKFILE=/project/esmf/rocky/ufs/hpc-stack/install-gnu9/lib/esmf.mk
+ BUILD_DIR=/project/esmf/rocky/ufs/ufs-weather-model/tests/build_fv3
+ mkdir -p /project/esmf/rocky/ufs/ufs-weather-model/tests/build_fv3
+ [[ -n '' ]]
+ CMAKE_FLAGS+=' -DNETCDF_DIR=/project/esmf/rocky/ufs/hpc-stack/install-gnu9'
+ cd /project/esmf/rocky/ufs/ufs-weather-model/tests/build_fv3
+ cmake /Volumes/esmf/rocky/ufs/ufs-weather-model/tests -DDEBUG=Y -DMULTI_GASES=OFF -DCCPP=ON -DMPI=ON -DCMAKE_BUILD_TYPE=Debug -DDYN32=OFF -DNETCDF_DIR=/project/esmf/rocky/ufs/hpc-stack/install-gnu9
CMake Error: The source directory "/project/esmf/rocky/ufs/ufs-weather-model/tests" does not appear to contain CMakeLists.txt.

This is the error:

/Volumes/esmf/rocky/ufs/ufs-weather-model/build.sh: line 5: greadlink: command not found

I believe this can be resolved by

brew install coreutils
climbfuji commented 3 years ago

See https://github.com/NOAA-EMC/NCEPLIBS-external/blob/develop/doc/README_macos_clanggfortran.txt or https://github.com/NOAA-EMC/NCEPLIBS-external/blob/develop/doc/README_macos_gccgfortran.txt

rsdunlapiv commented 3 years ago

That worked. The model finished fine with ESMF8.1.0bs21 as expected. I switched to ESMF8.1.0bs27 where I expect to see the same failure as @climbfuji did.

rsdunlapiv commented 3 years ago

@climbfuji unfortunately, my run with ESMF 8.1.0bs27 worked! From PET0.ESMF_LogFile

20201203 144112.539 INFO             PET0 Running with ESMF Version   : ESMF_8_1_0_beta_snapshot_27
20201203 144112.539 INFO             PET0 ESMF library build date/time: "Dec  3 2020" "14:14:39"
20201203 144112.539 INFO             PET0 ESMF library build location : /project/esmf/rocky/ufs/hpc-stack/pkg/ESMF_8_1_0_beta_snapshot_27

It terminated normally:

[0]      ENDING DATE-TIME    DEC 03,2020  14:47:58.918  338  THU   2459187
[0]      PROGRAM nems      HAS ENDED.
[0] * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . 
[0] *****************RESOURCE STATISTICS*******************************
[0] The total amount of wall time                        = 0.000000
[0] The total amount of time in user mode                = 389.377709
[0] The total amount of time in sys mode                 = 11.639710
[0] *****************END OF RESOURCE STATISTICS*************************
[0] 

This is with gnu/9.3.0 and mpich/3.3.1 with ESMF 8.1.0bs27 built in debug mode. I think you were using gnu/9.2.0, but I'm not convinced that that is the most likely difference. Not sure what to do next - should I try a later version of ESMF?

rsdunlapiv commented 3 years ago

Or do we need to look at something else OS-level?

kgerheiser commented 3 years ago

I'll give it a try and see what I get

climbfuji commented 3 years ago

Thanks for testing this Rocky. Too bad. I'll go and try my other laptop next.

What macOS version is yours?

rsdunlapiv commented 3 years ago
gdm-catania:rundir_fv3_ccpp_gfsv16beta_20201203 dunlap$ system_profiler SPSoftwareDataType
Software:

    System Software Overview:

      System Version: macOS 10.14.6 (18G103)
      Kernel Version: Darwin 18.7.0
rsdunlapiv commented 3 years ago

Since I'm already set up, I'll try the latest ESMF snapshot as well to see if anything comes up....

climbfuji commented 3 years ago

gdm-catania:rundir_fv3_ccpp_gfsv16beta_20201203 dunlap$ system_profiler SPSoftwareDataType Software:

System Software Overview:

  System Version: macOS 10.14.6 (18G103)
  Kernel Version: Darwin 18.7.0

Mine is almost the same

    System Software Overview:

      System Version: macOS 10.14.6 (18G4032)
      Kernel Version: Darwin 18.7.0

And I had tried the native AppleClang 10.0.1 with gfortran 9.2.0, LLVM Clang 9.0.0 with gfortran 9.2.0, and GNU gcc 9.2.0 with gfortran 9.2.0.

Since I can't get a stacktrace on my mac, is there a way to crank up the verbosity of ESMF, similar to what I found in MAIN_NEMS.F90 (remove the ! in front of all CALL ESMF_LogWrite calls)? Or some debug flag/parameter that can be turned on?

rsdunlapiv commented 3 years ago

@climbfuji I don't think there is anything already in ESMF that would produce extra output within ESMF_StateCreate which seems to be where things are dying. Maybe @theurich has a suggestion? One options would be for us to provide a branch of ESMF instrumented with some debug messages deeper into ESMF_StateCreate to see if we could track it down.

Another option would be to see if you can get one of the processors into a debugger. Since it is failing SO early, you might even be able to just run it on ONE process in GDB and it might expose this error before it complains about not running on enough PETs.

climbfuji commented 3 years ago

@rsdunlapiv adding print statements is tedious, but works ... I know by now that it fails in this block of code (between debug statement "E" and "F"):

        ! DH*
        CALL ESMF_LogWrite("DH DEBUG inside actual flag E",ESMF_LOGMSG_INFO,rc=RC)
        ! *DH
        if (present(nestedStateList)) then
           do i=1,size(nestedStateList)
ESMF_INIT_CHECK_DEEP(ESMF_StateGetInit,nestedStateList(i),rc)
           enddo
        endif
        ! DH*
        CALL ESMF_LogWrite("DH DEBUG inside actual flag F",ESMF_LOGMSG_INFO,rc=RC)
        ! *DH

Next is to check what is in this nestedStateList ...

kgerheiser commented 3 years ago

I finally got around to running it with ESMF beta 21 and 27. GCC-9, macOS 11.0, mpich 3.3.1 (built with hpc-stack).

They both fail, but I think it's unrelated to ESMF, and it gets farther than you.


FATAL from PE     0: MPP_OPEN: error in OPEN for RESTART/file.

FATAL from PE     0: MPP_OPEN: error in OPEN for RESTART/file.

Looks like it's looking for a file in RESTART which is empty. Is there some option I need to change?

I just didn't have a RESTART folder at all. Making an empty one fixed it.

It works with ESMF beta 27 and 21.

climbfuji commented 3 years ago

No, the case that I gave you runs out of the box.

On Dec 7, 2020, at 8:33 AM, Kyle Gerheiser notifications@github.com wrote:

I finally got around to running it with ESMF beta 21 and 27.

They both fail, but I think it's unrelated to ESMF, and it gets farther than you.

FATAL from PE 0: MPP_OPEN: error in OPEN for RESTART/file.

FATAL from PE 0: MPP_OPEN: error in OPEN for RESTART/file.

Looks like it's looking for a file in RESTART which is empty. Is there some option I need to change?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/issues/303#issuecomment-739992378, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5C2RIK6CGVMK5A5U2I523STTYTHANCNFSM4UHX4D7Q.

kgerheiser commented 3 years ago

I had no RESTART folder. So, I just created an empty one and that fixed it.

climbfuji commented 3 years ago

I had no RESTART folder. So, I just created an empty one and that fixed it.

Thanks for figuring that out. I am still debugging bs27 on my (weird ?) Mac ... if it hadn't been working just fine for two years up to bs21, I wouldn't be that worried.

climbfuji commented 3 years ago

Alright, adding more print statements around that present(nestedStateList) test makes the code pass the Create the NEMS Import State, but then it crashes around line endif ! - actualFlag in Create the NEMS Export State.

This doesn't make any sense. We are either facing a memory corruption (in ESMF) or a bug in the compiler or one of the libraries that gets used. Will try a brew update next.

rsdunlapiv commented 3 years ago

@climbfuji I agree it sounds like a memory corruption. It will be interesting to see if @kgerheiser can get it to run. Not sure if we could also get one other test on another mac machine to see how isolated the issue is.

kgerheiser commented 3 years ago

@rsdunlapiv I was able to run it without issue

climbfuji commented 3 years ago

@climbfuji I agree it sounds like a memory corruption. It will be interesting to see if @kgerheiser can get it to run. Not sure if we could also get one other test on another mac machine to see how isolated the issue is.

Since I can't get a stack trace on macOS (and even if I could it may not be helpful), we could try valgrind if I get it running on macOS (or valgrid ir DDT on cheyenne with GNU 9.3, for example). Since it crashes so early on, we should have a fair chance to detect a possible memory corruption.

rsdunlapiv commented 3 years ago

Did you try running the whole thing on just one process through GDB?