ufs-community / ufs-srweather-app

UFS Short-Range Weather Application
Other
54 stars 116 forks source link

Runtime error in make_grid task on Gaea-c5 after the SRW-installed conda build, likely due to library conflict #991

Closed natalie-perlin closed 5 months ago

natalie-perlin commented 7 months ago

Expected behavior

Building SRW on Gaea-c5 successfully completes, which includes building conda python base packages and conda runtime environments, as well as binaries for the SRW. A test case that includes preprocessing tasks such as make_grid successfully completes.

Current behavior

The SRW build launched using "./devbuild.sh -p=gaea-c5 " completes successfully. A test from WE2E grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_RAP fails in the very first preprocessing task, make_grid with the following error:

_/lustre/f2/scratch/ncep/Natalie.Perlin/C5/SRW/srw-ss150/exec/regional_esg_grid: /opt/cray/pe/gcc/10.3.0/snos/lib/../lib64/libstdc++.so.6: version `GLIBCXX3.4.30' not found (required by /lustre/f2/scratch/ncep/Natalie.Perlin/C5/SRW/srw-ss150/conda/lib/./libicuuc.so.73)

The code tested included changes to Gaea-c5 software stack to spack-stack-1.5.0 (https://github.com/ufs-community/ufs-srweather-app/pull/969 ). A runtime error however does not appear to be related to the new spack-stack tested, but to Cray modules and the conda installed.

Machines affected

Gaea-c5

Steps To Reproduce

git clone -b ss150 git@github.com:RatkoVasic-NOAA/ufs-srweather-app.git srw-ss150 cd srw-ss150/ module load python3 ./manage_externals/checkout_externals ./devbuild.sh -p=gaea-c5 module use $PWD/modulefiles module load wflow_gaea-c5 conda activate srw_app ./run_WE2E_tests.py -m=gaea-c5 --launch python -d -a epic -t grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_RAP

The test eventually dies. The error could be located in a log file from the make_grid task: cd ../../../expt_dirs/grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_RAP/log/ view make_grid_2019061518.log

Detailed Description of Fix (optional)

Cray modules such as craype and installed conda use one of the standard libraries, libstdc++.so.6 This library could be located on the filesystem in several locations, such as

  1. /usr/lib64/libstdc++.so.6 (system default)
  2. /opt/cray/pe/gcc/10.3.0/snos/lib64/libstdc++.so.6 (from craype module)
  3. /conda/lib/libstdc++.so.6 (conda libraries from the SRW install)

Number 1. and 3. have the version of GLIBCXX_3.4.30 included, but number 2 has only GLIBCXX_3.4.28 as the highest version. Library from 2. is used to build executable make_grid. When conda environment is used, the error results due to conda library not finding GLIBCXX of high enough version.

Things tested: explicitly including paths such as /usr/lib64 or ./conda/lib in LD_LIBRARY_PATH or LIBRARY_PATH variables during the build and linking stage for the executable, and during the runtime environment. Not solved so far. Plan to do few more tests to have a more clear picture of the things tested in search for resolving the issue, before filing a ticket to Gaea-c5 HelpDesk.

Additional Information (optional)

Possible Implementation (optional)

Output (optional)

natalie-perlin commented 7 months ago

The following approach is suggested for Gaea-c5 while F2 filesystem is used: not building/installing SRW conda environment but use the earlier miniconda3 and conda environments. A PR has been made https://github.com/RatkoVasic-NOAA/ufs-srweather-app/tree/ss150 that contributes to https://github.com/ufs-community/ufs-srweather-app/pull/969 Then F5 filesystem is ready, this needs to be re-evaluated and a different solution found to use SRW-built conda packages.

The reasons are as following.

Diagnostics:

With the current SRW conda (version 23.3.1-1) installation, the library libstdc++.so.6 installed in ./conda/lib/; a version/string GLIBCXX-3.4.30 from that library is also used for another shared library in conda location, ./conda/lib/libicuuc.so.73 . At the installation and runtime, another libstdc++.so.6 happens to be in the path, due to one of the mandatory cray modules loaded (craype): /opt/cray/pe/gcc/10.3.0/snos/lib/../lib64/libstdc++.so.6 . This library has the highest GLIBCXX string only 3.4.28 (lower version). With the way the SRW submodules are configured, several binaries are build with that library, /opt/cray/pe/gcc/10.3.0/snos/lib/../lib64/libstdc++.so.6 ( highest GLIBCXX is 3.40.28), and another binaries are built with /usr/lib64/libstdc++.so.6 (highest GLIBCXX is 3.40.32). See a screenshot attached that checks the libstdc++.so.6 library used by each of the SRW binaries:

SRW_libstdcxx_1

The runtime error is as following:

_/lustre/f2/scratch/ncep/Natalie.Perlin/C5/SRW/srw-ss150/exec/regional_esg_grid: /opt/cray/pe/gcc/10.3.0/snos/lib/../lib64/libstdc++.so.6: version `GLIBCXX3.4.30' not found (required by /lustre/f2/scratch/ncep/Natalie.Perlin/C5/SRW/srw-ss150/conda/lib/./libicuuc.so.73)

Possible solution, # 1

In order to have GLIBCXX_3.2.30 used at a runtime, CMake (CMakeLists.txt) of SRW submodules and subdirectories could be explicitly configured to be use libstc++.so.6 library with the higher GLIBCXX_2.4.30 version, for example /usr/lib64/libstdc++.so.6. The following snapshot shows all the SRW binaries using to /usr/lib64/libstdc++.so.6 :

SRW_libstdcxx_2

Issues with the solution # 1

While make_grid task could proceed further after using solution # 1, it turns out that other binaries from the spack-stack modules were still built with /opt/cray/pe/gcc/10.3.0/snos/lib/../lib64/libstdc++.so.6 , which have lower GLIBCXX_3.4.28 . That causes the error when ncdump binary is used, similar to the original error. Reinstalling and relinking all spack-stack binaries to /usr/lib64/libstdc++.so.6 is not a viable option.

Possible Solution, # 2

Use SRW conda installation of lower version than currently given in devbuild.sh script, which has libstdc++.so.6 with lower GLIBCXX string version. This solution could be tested. However, the current Gaea-c5 F2 filesystem is going to be decommissioned soon.

Current Solution, # 3

At the present moment, use older miniconda3/4.12.0 and conda environments (regional_workflow, workflow_tools) available on Gaea-c5. Bypass building conda and environments in SRW devbuild.sh, similar to a solution for the wcoss2 platform.

Future solution, # 4

When F5 filesystem is ready and in production, newer system libraries and modules will be built. Then, re-evaluate installation of conda as part of the SRW.

MichaelLueken commented 6 months ago

@natalie-perlin - Following the transition from the F2 to F5 filesystem, I have successfully compiled the SRW App using the CONDA_BUILD methodology. However, both the make_grid and make_ics/make_lbcs tasks fail with either:

/gpfs/f5/epic/scratch/Michael.Lueken/ufs-srweather-app/gaeac5/install_intel/exec/regional_esg_grid: symbol lookup error: /usr/lib64/libssh.so.4: undefined symbol: EVP_KDF_CTX_new_id, version OPENSSL_1_1_1d

/gpfs/f5/epic/scratch/Michael.Lueken/ufs-srweather-app/gaeac5/install_intel/exec/chgres_cube: symbol lookup error: /usr/lib64/libssh.so.4: undefined symbol: EVP_KDF_CTX_new_id, version OPENSSL_1_1_1d

Using ldd on regional_esg_grid and chgres_cube, I don't even see libssh.so.4 being linked.

gaea57 install_intel/exec> ldd regional_esg_grid 
    linux-vdso.so.1 (0x00007ffe23b33000)
    libnetcdff.so.7 => not found
    libnetcdf.so.19 => not found
    libm.so.6 => /lib64/libm.so.6 (0x00007feee7761000)
    libdl.so.2 => /lib64/libdl.so.2 (0x00007feee775a000)
    libxpmem.so.0 => /opt/cray/xpmem/default/lib64/libxpmem.so.0 (0x00007feee7757000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00007feee7733000)
    libc.so.6 => /lib64/libc.so.6 (0x00007feee753c000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007feee7518000)
    /lib64/ld-linux-x86-64.so.2 (0x00007feee78ce000)
gaea57 install_intel/exec> ldd chgres_cube 
    linux-vdso.so.1 (0x00007ffd0bbe7000)
    libpng16.so.16 => /usr/lib64/libpng16.so.16 (0x00007f8b5e600000)
    libz.so.1 => /lib64/libz.so.1 (0x00007f8b5e8ae000)
    libjasper.so.4 => /usr/lib64/libjasper.so.4 (0x00007f8b5e200000)
    libjpeg.so.62 => /usr/lib64/libjpeg.so.62 (0x00007f8b5de00000)
    libsci_intel_mpi_mp.so.5 => /opt/cray/pe/lib64/libsci_intel_mpi_mp.so.5 (0x00007f8b5d400000)
    libsci_intel_mp.so.5 => /opt/cray/pe/lib64/libsci_intel_mp.so.5 (0x00007f8b59a00000)
    libiomp5.so => /opt/intel/oneapi/compiler/2023.1.0/linux/compiler/lib/intel64/libiomp5.so (0x00007f8b59400000)
    librt.so.1 => /lib64/librt.so.1 (0x00007f8b5e8a2000)
    libdl.so.2 => /lib64/libdl.so.2 (0x00007f8b5e89d000)
    libnetcdf.so.19 => not found
    libnetcdff.so.7 => not found
    libm.so.6 => /lib64/libm.so.6 (0x00007f8b5e4b4000)
    libpioc.so => not found
    libirng.so => /opt/intel/oneapi/compiler/2023.1.0/linux/compiler/lib/intel64/libirng.so (0x00007f8b59000000)
    libcilkrts.so.5 => /opt/intel/oneapi/compiler/2023.1.0/linux/compiler/lib/intel64/libcilkrts.so.5 (0x00007f8b58c00000)
    libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00007f8b589bb000)
    libmpifort_intel.so.12 => /opt/cray/pe/lib64/libmpifort_intel.so.12 (0x00007f8b58600000)
    libmpi_intel.so.12 => /opt/cray/pe/lib64/libmpi_intel.so.12 (0x00007f8b55a00000)
    libxpmem.so.0 => /opt/cray/xpmem/default/lib64/libxpmem.so.0 (0x00007f8b5e896000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f8b5e872000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f8b5e84e000)
    libc.so.6 => /lib64/libc.so.6 (0x00007f8b55809000)
    libjpeg.so.8 => /usr/lib64/libjpeg.so.8 (0x00007f8b55400000)
    libifcoremt.so.5 => /opt/intel/oneapi/compiler/2023.1.0/linux/compiler/lib/intel64/libifcoremt.so.5 (0x00007f8b5dc86000)
    libsvml.so => /opt/intel/oneapi/compiler/2023.1.0/linux/compiler/lib/intel64/libsvml.so (0x00007f8b53dd7000)
    libimf.so => /opt/intel/oneapi/compiler/2023.1.0/linux/compiler/lib/intel64/libimf.so (0x00007f8b539ed000)
    libintlc.so.5 => /opt/intel/oneapi/compiler/2023.1.0/linux/compiler/lib/intel64/libintlc.so.5 (0x00007f8b5e189000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f8b5e8e7000)
    libifcore.so.5 => /opt/intel/oneapi/compiler/2023.1.0/linux/compiler/lib/intel64/libifcore.so.5 (0x00007f8b598be000)
    libfabric.so.1 => /usr/lib64/libfabric.so.1 (0x00007f8b53400000)
    libatomic.so.1 => /usr/lib64/libatomic.so.1 (0x00007f8b5e4aa000)
    libpmi.so.0 => /opt/cray/pe/lib64/libpmi.so.0 (0x00007f8b53000000)
    libpmi2.so.0 => /opt/cray/pe/lib64/libpmi2.so.0 (0x00007f8b52c00000)
    libifport.so.5 => /opt/intel/oneapi/compiler/2023.1.0/linux/compiler/lib/intel64/libifport.so.5 (0x00007f8b5e481000)
    librdmacm.so.1 => /usr/lib64/librdmacm.so.1 (0x00007f8b52800000)
    libefa.so.1 => /usr/lib64/libefa.so.1 (0x00007f8b5e476000)
    libnl-3.so.200 => /usr/lib64/libnl-3.so.200 (0x00007f8b52400000)
    libpsm_infinipath.so.1 => /usr/lib64/libpsm_infinipath.so.1 (0x00007f8b52000000)
    libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x00007f8b51c00000)
    libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x00007f8b51800000)
    libuuid.so.1 => /usr/lib64/libuuid.so.1 (0x00007f8b5e46d000)
    libpals.so.0 => /opt/cray/pe/lib64/libpals.so.0 (0x00007f8b51400000)
    libinfinipath.so.4 => /usr/lib64/libinfinipath.so.4 (0x00007f8b51000000)
    libnl-route-3.so.200 => /usr/lib64/libnl-route-3.so.200 (0x00007f8b50c00000)
    libjansson.so.4 => /usr/lib64/libjansson.so.4 (0x00007f8b50800000)

It's not clear to me what the issue is. I've reached out to the Gaea system administrators to see if they may be able to shed some light on what is happening.