ufs-community / ufs-weather-model

UFS Weather Model
Other
129 stars 238 forks source link

MPI_Type_contiguous Encounters Invalid Count #2227

Closed spanNOAA closed 1 month ago

spanNOAA commented 1 month ago

Description

An MPI-related fatal error occurred during the execution of the code, leading to job cancellation.

To Reproduce:

Compilers: intel/2022.1.2, impi/2022.1.2, stack-intel/2021.5.0, stack-intel-oneapi-mpi/2021.5.1 Platform: Hera (Rocky 8)

  1. copy the canned test case from /scratch2/BMC/wrfruc/Sijie.Pan/ufs-ar/comroot/fcst.test
  2. submit slurm job submit_ufs_model.sh.
  3. check the output file slurm-${jobid}.out

Additional context

The problem specifically arises on the 2304th core.

Output

ufs_model_crash.log

jkbk2004 commented 1 month ago
2304: Abort(805961730) on node 2304 (rank 2304 in comm 0): Fatal error in PMPI_Type_contiguous: Invalid count, error stack:
2304: PMPI_Type_contiguous(271): MPI_Type_contiguous(count=-2056576882, MPI_BYTE, new_type_p=0x7ffd536cb594) failed
2304: PMPI_Type_contiguous(238): Negative count, value is -2056576882

MPI_Type_contiguous count number can not be negative. There is no direct call of MPI_Type_contiguous inside model code base. @spanNOAA can you run exactly same canned case on other machine like orion/hercules? so we can see if we can isolate a root cause or mpi package installation issue or not?

jkbk2004 commented 1 month ago
2304: Abort(805961730) on node 2304 (rank 2304 in comm 0): Fatal error in PMPI_Type_contiguous: Invalid count, error stack:
2304: PMPI_Type_contiguous(271): MPI_Type_contiguous(count=-2056576882, MPI_BYTE, new_type_p=0x7ffd536cb594) failed
2304: PMPI_Type_contiguous(238): Negative count, value is -2056576882

MPI_Type_contiguous count number can not be negative. There is no direct call of MPI_Type_contiguous inside model code base. @spanNOAA can you run exactly same canned case on other machine like orion/hercules? so we can see if we can isolate a root cause or mpi package installation issue or not?

@junwang-noaa @DusanJovic-NOAA @spanNOAA I don't know if compiling with -traceback might be a good option to trace in this case.

JessicaMeixner-NOAA commented 1 month ago

Just wanted to post here that I also got this error as did @ChristianBoyer-NOAA from the physics team trying to run a C768 test case from the g-w (develop branch as of today).

DusanJovic-NOAA commented 1 month ago

Just wanted to post here that I also got this error as did Christian Boyer from the physics team trying to run a C768 test case from the g-w (develop branch as of today).

Do they also see this error on Hera? Could it be related to an update of the OS? Has anyone made a successful C768 run on Hera recently?

DusanJovic-NOAA commented 1 month ago
2304: Abort(805961730) on node 2304 (rank 2304 in comm 0): Fatal error in PMPI_Type_contiguous: Invalid count, error stack:
2304: PMPI_Type_contiguous(271): MPI_Type_contiguous(count=-2056576882, MPI_BYTE, new_type_p=0x7ffd536cb594) failed
2304: PMPI_Type_contiguous(238): Negative count, value is -2056576882

MPI_Type_contiguous count number can not be negative. There is no direct call of MPI_Type_contiguous inside model code base. @spanNOAA can you run exactly same canned case on other machine like orion/hercules? so we can see if we can isolate a root cause or mpi package installation issue or not?

@junwang-noaa @DusanJovic-NOAA @spanNOAA I don't know if compiling with -traceback might be a good option to trace in this case.

We do compile the code with -traceback flag by default.

JessicaMeixner-NOAA commented 1 month ago

@DusanJovic-NOAA - @ChristianBoyer-NOAA has not been able to successfully run C768 since the rocky8 transition. I just ran a case and got the same error and then saw this issue that reported the same problem. I have asked a few people and I don't know if anyone has successfully run C768 on hera since the rocky8 transition.

SamuelTrahanNOAA commented 1 month ago

I can run the global static nest configuration with both my modified global-workflow and the HAFS workflow. I haven't tried a globe without a nest.

EDIT: Those are both atmosphere-only forecast-only cases.

zhanglikate commented 1 month ago

Just wanted to post here that I got the same issue when I ran the C768 in Rocky 8 Hera. I git clone the latest version of global work flow for Rocky 8 (April 2 version, commit c54fe98c4fe8d811907366d4ba6ff16347bf174c) and try the C768 run with ATM only, however, it always crash by showing the following information at Hera Rocky 8. While I did not see this issue in C384 and C96. This is the log file /scratch1/BMC/gsd-fv3-dev/NCEPDEV/global/Kate.Zhang/fv3gfs/comrot/TC768/logs/2020070100/gfsfcst.log

This is the job submit directory: /scratch2/BMC/gsd-fv3-dev/NCEPDEV/global/Kate.Zhang/fv3gfs/expdir/TC768

@JessicaMeixner-NOAA @DusanJovic-NOAA @spanNOAA @junwang-noaa

SamuelTrahanNOAA commented 1 month ago

Here are the relevant lines of @zhanglikate's log file.

EDIT: Here is just the error message:

4608: Abort(470417410) on node 4608 (rank 4608 in comm 0): Fatal error in PMPI_Type_contiguous: Invalid count, error stack:
4608: PMPI_Type_contiguous(271): MPI_Type_contiguous(count=-2057309534, MPI_BYTE, new_type_p=0x7ffc225e8994) failed
4608: PMPI_Type_contiguous(238): Negative count, value is -2057309534
   0: slurmstepd: error: *** STEP 58066771.0 ON h3c39 CANCELLED AT 2024-04-08T06:27:30 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: h15c49: tasks 5080-5119: Killed
Expand to see modules, versions, prologue, epilogue, etc. ``` Begin fcst.sh at Mon Apr 8 06:20:19 UTC 2024 ... many lines of stuff ... Running "module reset". Resetting modules to system default. The following $MODULEPATH directories have been removed: /scratch1/BMC/gmtb/software/modulefiles/generic /scratch2/NCEPDEV/nwprod/NCEPLIBS/modulefiles Currently Loaded Modules: 1) contrib 42) sp/2.5.0 2) intel/2022.1.2 43) ip/4.3.0 3) stack-intel/2021.5.0 44) grib-util/1.3.0 4) impi/2022.1.2 45) g2tmpl/1.10.2 5) stack-intel-oneapi-mpi/2021.5.1 46) gsi-ncdiag/1.1.2 6) gettext/0.19.8.1 47) crtm-fix/2.4.0.1_emc 7) libxcrypt/4.4.35 48) git-lfs/2.10.0 8) zlib/1.2.13 49) crtm/2.4.0.1 9) sqlite/3.43.2 50) openblas/0.3.24 10) util-linux-uuid/2.38.1 51) py-setuptools/63.4.3 11) python/3.11.6 52) py-numpy/1.23.4 12) hpss/hpss 53) bufr/11.7.0 13) gempak/7.4.2 54) gmake/3.82 14) ncl/6.6.2 55) wgrib2/2.0.8 15) libjpeg/2.1.0 56) py-cftime/1.0.3.4 16) jasper/2.0.32 57) py-netcdf4/1.5.8 17) libpng/1.6.37 58) libyaml/0.2.5 18) openjpeg/2.3.1 59) py-pyyaml/6.0 19) eccodes/2.32.0 60) py-markupsafe/2.1.3 20) fftw/3.3.10 61) py-jinja2/3.1.2 21) nghttp2/1.57.0 62) py-bottleneck/1.3.7 22) curl/8.4.0 63) py-numexpr/2.8.4 23) proj/8.1.0 64) py-et-xmlfile/1.0.1 24) udunits/2.2.28 65) py-openpyxl/3.1.2 25) cdo/2.2.0 66) py-pytz/2023.3 26) R/3.5.0 67) py-pyxlsb/1.0.10 27) perl/5.38.0 68) py-xlrd/2.0.1 28) pkg-config/0.27.1 69) py-xlsxwriter/3.1.7 29) hdf5/1.14.0 70) py-xlwt/1.3.0 30) snappy/1.1.10 71) py-pandas/1.5.3 31) zstd/1.5.2 72) py-six/1.16.0 32) c-blosc/1.21.5 73) py-python-dateutil/2.8.2 33) netcdf-c/4.9.2 74) g2c/1.6.4 34) netcdf-fortran/4.6.1 75) netcdf-cxx4/4.3.1 35) antlr/2.7.7 76) met/9.1.3 36) gsl/2.7.1 77) metplus/3.1.1 37) nco/5.0.6 78) py-packaging/23.1 38) bacio/2.4.1 79) py-xarray/2023.7.0 39) w3emc/2.10.0 80) prepobs/1.0.1 40) prod_util/2.1.1 81) fit2obs/1.0.0 41) g2/3.4.5 82) module_base.hera ... many lines of stuff ... + exglobal_forecast.sh[153]: unset OMP_NUM_THREADS + exglobal_forecast.sh[158]: /bin/cp -p /scratch1/BMC/gsd-fv3-dev/lzhang/Rocky8/global-workflow/exec/ufs_model.x /scratch1/NCEPDEV/stmp2/Kate.Zhang/RUNDIRS/TC768/fcst.424943/ + exglobal_forecast.sh[159]: srun -l --export=ALL -n 6560 /scratch1/NCEPDEV/stmp2/Kate.Zhang/RUNDIRS/TC768/fcst.424943/ufs_model.x 0: 0: 0: * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . 0: PROGRAM ufs-weather-model HAS BEGUN. COMPILED 0.00 ORG: np23 0: STARTING DATE-TIME APR 08,2024 06:21:38.287 99 MON 2460409 0: 0: 0: MPI Library = Intel(R) MPI Library 2021.5 for Linux* OS 0: 0: MPI Version = 3.1 ... many lines of stuff ... 0: PASS: fcstRUN phase 1, n_atmsteps = 113 time is 1.956091 0: PASS: fcstRUN phase 2, n_atmsteps = 113 time is 0.027836 4: ncells= 5 4: nlives= 12 4: nthresh= 18.0000000000000 4608: Abort(470417410) on node 4608 (rank 4608 in comm 0): Fatal error in PMPI_Type_contiguous: Invalid count, error stack: 4608: PMPI_Type_contiguous(271): MPI_Type_contiguous(count=-2057309534, MPI_BYTE, new_type_p=0x7ffc225e8994) failed 4608: PMPI_Type_contiguous(238): Negative count, value is -2057309534 0: slurmstepd: error: *** STEP 58066771.0 ON h3c39 CANCELLED AT 2024-04-08T06:27:30 *** srun: Job step aborted: Waiting up to 32 seconds for job step to finish. srun: error: h15c49: tasks 5080-5119: Killed srun: Terminating StepId=58066771.0 srun: error: h15c48: tasks 5040-5079: Killed srun: error: h22c04: tasks 5320-5359: Killed ... many lines of stuff ... + exglobal_forecast.sh[1]: postamble exglobal_forecast.sh 1712557241 137 + preamble.sh[70]: set +x End exglobal_forecast.sh at 06:27:31 with error code 137 (time elapsed: 00:06:50) + JGLOBAL_FORECAST[1]: postamble JGLOBAL_FORECAST 1712557222 137 + preamble.sh[70]: set +x End JGLOBAL_FORECAST at 06:27:31 with error code 137 (time elapsed: 00:07:09) + fcst.sh[1]: postamble fcst.sh 1712557219 137 + preamble.sh[70]: set +x End fcst.sh at 06:27:31 with error code 137 (time elapsed: 00:07:12) _______________________________________________________________ Start Epilog on node h3c39 for job 58066771 :: Mon Apr 8 06:27:31 UTC 2024 Job 58066771 finished for user Kate.Zhang in partition hera with exit code 137:0 _______________________________________________________________ End Epilogue Mon Apr 8 06:27:31 UTC 2024 ```
junwang-noaa commented 1 month ago

@XiaqiongZhou-NOAA Please see issue here. My understanding is that you got the same error on wcoss2 and Orion. Would you please try the 3/11 model version (5b62e1aa2e67ea58680d58ab16264d69b4085ea8) on wcoss2 to see if you still got this error? Thanks

From Kate: I got the model crash on both WCOSS2 and Orion with the same error information. The UFS model is the March 22 version. I also got the same error on Hercules with the UFS Feb.21 version.

Abort(1007294466) on node 2304 (rank 2304 in comm 0): Fatal error in PMPI_Type_contiguous: Invalid count, error stack: 2304: PMPI_Type_contiguous(275): MPI_Type_contiguous(count=-2056678757, MPI_BYTE, new_type_p=0x7ffe7b8d3b54) failed 2304: PMPI_Type_contiguous(243): Negative count, value is -2056678757

The log files are here: /work2/noaa/stmp/xzhou/c768/logs/2020010200 (Hercules) /lfs/h2/emc/ptmp/xiaqiong.zhou/c768_ctl/logs/2020010200 (WCOSS2) /work/noaa/stmp/xzhou/c768/logs/2020010200 (Orion)

SamuelTrahanNOAA commented 1 month ago

My successful runs use an older version of the scripts, but they do use the latest code.

junwang-noaa commented 1 month ago

@SamuelTrahanNOAA are you running the C768 global in your global static nest configuration case?

zhanglikate commented 1 month ago

Judy has a GSL version working before, which was based on the EMC Jan2024 version: https://github.com/NOAA-GSL/global-workflow/tree/gsl_ufs_rt . However, it can not run after the OS transition to Rocky 8.

My successful runs use an older version of the scripts, but they do use the latest code.

spanNOAA commented 1 month ago
2304: Abort(805961730) on node 2304 (rank 2304 in comm 0): Fatal error in PMPI_Type_contiguous: Invalid count, error stack:
2304: PMPI_Type_contiguous(271): MPI_Type_contiguous(count=-2056576882, MPI_BYTE, new_type_p=0x7ffd536cb594) failed
2304: PMPI_Type_contiguous(238): Negative count, value is -2056576882

MPI_Type_contiguous count number can not be negative. There is no direct call of MPI_Type_contiguous inside model code base. @spanNOAA can you run exactly same canned case on other machine like orion/hercules? so we can see if we can isolate a root cause or mpi package installation issue or not?

I've attempted the canned case on Orion, and unfortunately, the same issue persists. Specifically, it still occurs on processor 2304. However, I have no problem with running C384.

SamuelTrahanNOAA commented 1 month ago

@SamuelTrahanNOAA are you running the C768 global in your global static nest configuration case?

I've run the C96, C192, and C384 with the latest version of my workflow. In an hour or two, I'll test the C768 with the latest version. (I have to copy the new fix files and ICs I generated and regenerate the expdir.)

I have not merged the latest develop scripts. I'm still using older scripts, but I am using newer ufs-weather-model code. My code has two bug fixes, but they are unlikely to be related to this problem (#2201)

JessicaMeixner-NOAA commented 1 month ago

Has anyone opened a hera help desk ticket on this issue by any chance?

kayeekayee commented 1 month ago

GSL real time experiments ran the C768 case until 4/3 when the OS completely updated to Rocky8: /scratch1/BMC/gsd-fv3/rtruns/UFS-CAMsuite. Here is the vesion that works for C768 in our realtime:
12Jan24 global-workflow UFS: 29Jan24, 625ac02 FV3: 29Jan24, bd38c56 (GSL: 28Feb24 , a439cc7) UPP: 07Nov23, 78f369b UFS_UTILS: 22Dev23, ce385ce You can see the gfsfcst log here: /scratch1/BMC/gsd-fv3/rtruns/UFS-CAMsuite/FV3GFSrun/rt_v17p8_ugwpv1_mynn/logs/2024040200/gfsfcst.log

junwang-noaa commented 1 month ago

@kayeekayee Thanks for the information. So model version on Jan 29, 2024 works fine.

I am wondering anyone runs C768 model with a more recent version. Since the same error showed up on wcoss2 and orion, I am thinking if it's the code updates that cause the problem.

SamuelTrahanNOAA commented 1 month ago

I'm able to run with this version of the code:

My test is a C768 resolution globe rotated and stretched, with a nest added inside one global tile. (The script calls it CASE=W768.) It won't run without the fixes in that PR due to some bugs in the nesting framework which break GFS physics.

EDIT: I can give people instructions on how to run the nested global configuration if you want to try my working test case. It uses the global-workflow, but an older version, and forecast-only.

junwang-noaa commented 1 month ago

Thanks, @SamuelTrahanNOAA. How many tasks are you using for the C768 global domain?

@spanNOAA @JessicaMeixner-NOAA @zhanglikate @XiaqiongZhou-NOAA Would you like to try Sam's version to build the executable and see if you can run the C768 test case?

SamuelTrahanNOAA commented 1 month ago

I'm using 2 threads. This is the task geometry:

I don't know why the write groups need 27 compute nodes each, but they run out of memory if I give them less, even without the post.

The reason for this vast 210 node task geometry is that it finishes a five day forecast in under eight hours.

JessicaMeixner-NOAA commented 1 month ago

@ChristianBoyer-NOAA would you have time to try this? I will not have time to try this until next week, but will try it then.

zhanglikate commented 1 month ago

Sure. I can give a try. Please let me know how to test it in the global workflow environment. Thanks.

Kate

On Mon, Apr 8, 2024 at 6:27 PM Jun Wang @.***> wrote:

Thanks, @SamuelTrahanNOAA https://github.com/SamuelTrahanNOAA. How many tasks are you using for the C768 global domain?

@spanNOAA https://github.com/spanNOAA @JessicaMeixner-NOAA https://github.com/JessicaMeixner-NOAA @zhanglikate https://github.com/zhanglikate @XiaqiongZhou-NOAA https://github.com/XiaqiongZhou-NOAA Would you like to try Sam's version to build the executable and see if you can run the C768 test case?

— Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/issues/2227#issuecomment-2043929452, or unsubscribe https://github.com/notifications/unsubscribe-auth/APJPDRDSYR22A57RJGTPKTDY4MYWJAVCNFSM6AAAAABFYBVFGSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBTHEZDSNBVGI . You are receiving this because you were mentioned.Message ID: @.***>

SamuelTrahanNOAA commented 1 month ago

I doubt my PR will fix the problem, but you can try it if you wish. It should be a drop-in replacement for the sorc/ufs_model.fd directory in the global-workflow.

zhanglikate commented 1 month ago

@SamuelTrahanNOAA Can you send your code path to me? Thanks.

lisa-bengtsson commented 1 month ago

I wonder if it is related to the physics suite? Sam is running the global_nest_v1 suite, I'm not sure which physics suite GSL is running in their experiments referenced above at C768, but it would be interesting to know if it is specifically the GFS physics suite?

SamuelTrahanNOAA commented 1 month ago

I wonder if it is related to the physics suite? Sam is running the global_nest_v1 suite, I'm not sure which physics suite GSL is running in their experiments referenced above at C768, but it would be interesting to know if it is specifically the GFS physics suite?

No.

Also: The crash is coming from the write component, not the compute ranks.

lisa-bengtsson commented 1 month ago

I wonder if it is related to the physics suite? Sam is running the global_nest_v1 suite, I'm not sure which physics suite GSL is running in their experiments referenced above at C768, but it would be interesting to know if it is specifically the GFS physics suite?

No.

  • My successful global-workflow runs used the GFS suite.
  • My successful HAFS AR workflow runs used the global_nest_v1 suite.

Also: The crash is coming from the write component, not the compute ranks.

Ok, thanks for clarifying!

SamuelTrahanNOAA commented 1 month ago

Can you send your code path to me? Thanks

It is better for you to compile it yourself. This might work:

cd global-workflow/sorc/ufs_model.fd
git stash
git remote add sam https://github.com/SamuelTrahanNOAA/ufs-weather-model
git fetch sam
git checkout -b nesting-fixes sam/nesting-fixes
git submodule sync
git submodule update --init --recursive --force
cd ..
./build_ufs.sh
zhanglikate commented 1 month ago

Sam,

 I got it.

commit 811c90d48758984f1510772a12909a3d0aa09c53 (HEAD -> nesting-fixes, origin/nesting-fixes) Merge: 1712e506 87c27b92 Author: samuel.trahan @.***> Date: Mon Apr 1 17:39:50 2024 +0000

merge upstream develop

On Apr 8, 2024, at 9:07 PM, Samuel Trahan (NOAA contractor) @.***> wrote:

Can you send your code path to me? Thanks

It is better for you to compile it yourself. This might work:

cd global-workflow/sorc/ufs_model.fd git stash git remote add sam https://github.com/SamuelTrahanNOAA/ufs-weather-model git fetch sam git checkout -b nesting-fixes sam/nesting-fixes git submodule sync git submodule update --init --recursive --force — Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/issues/2227#issuecomment-2044073381, or unsubscribe https://github.com/notifications/unsubscribe-auth/APJPDRC7B7TPWORIXMVXTY3Y4NLOLAVCNFSM6AAAAABFYBVFGSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBUGA3TGMZYGE. You are receiving this because you were mentioned.

SamuelTrahanNOAA commented 1 month ago

Ah! Bad news. My run failed when I ran it in the global-workflow, even though it succeeded outside the global-workflow:

6240: Abort(805961730) on node 6240 (rank 6240 in comm 0): Fatal error in PMPI_Type_contiguous: Invalid count, error stack:
6240: PMPI_Type_contiguous(271): MPI_Type_contiguous(count=-2057274945, MPI_BYTE, new_type_p=0x7ffcb2273514) failed
6240: PMPI_Type_contiguous(238): Negative count, value is -2057274945
   0: slurmstepd: error: *** STEP 58100582.0 ON h1c26 CANCELLED AT 2024-04-08T23:47:11 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

EDIT: It takes 7.8 hours for that job to finish, so I didn't know for quite a long time.

SamuelTrahanNOAA commented 1 month ago

Actually, that's mixed news. It means I might be able to narrow down what the global-workflow is doing differently to break things.

EDIT: I see three obvious differences between my successful runs in HAFS-AR and staged test cases vs this latest failed run in global-workflow:

  1. global-workflow sets different environment variables before running.
  2. global-workflow uses different modulefiles.
  3. global-workflow uses ESMF threading while HAFS-AR and my staged test cases don't.
  4. global-workflow has numerous symbolic links instead of regular files

I've submitted a test job to test item 1 and 2 together. My hunch is that item 3 is causing the problem.

ChristianBoyer-NOAA commented 1 month ago

@JessicaMeixner-NOAA I'll be able to try it today.

SamuelTrahanNOAA commented 1 month ago

@ChristianBoyer-NOAA - Please try with and without ESMF threading, if you can. The global-workflow can only use ESMF threading; they dropped support for straight openmp threading a while ago.

ChristianBoyer-NOAA commented 1 month ago

@SamuelTrahanNOAA - Sounds good. I will give it a try with and without ESMF threading.

junwang-noaa commented 1 month ago

@SamuelTrahanNOAA Thanks for doing the testing. I think @aerorahul has the code changes in G-W to use both ESMF managed threading and traditional threading.

jkbk2004 commented 1 month ago

@SamuelTrahanNOAA Thanks for doing the testing. I think @aerorahul has the code changes in G-W to use both ESMF managed threading and traditional threading.

G-W conventional and esmf threading: https://github.com/ufs-community/ufs-weather-model/pull/2179

DusanJovic-NOAA commented 1 month ago

I ran the control_c768 test from rt_weekly.conf with the current develop branch (45c8b2a) and it finished successfully on Hera. Although it only runs for 3 hours. I'll resubmit the job with nhours_fcst 24.

SamuelTrahanNOAA commented 1 month ago

A difference between control_c768 and some of the runs in the workflow is size. A 210 node job on Hera is very big and will detect problems a few dozen nodes wouldn't. GNU OpenMP can't even function at that size; it fails with 100% reliability. The regular global-workflow runs may be in the 80-140 node range, but that's still a huge job.

SamuelTrahanNOAA commented 1 month ago

I ran my test case without ESMF threading and with the ufs-weather-model modulefiles. It still failed. I'm going to investigate further what the differences are between this test case and my past C768s that have succeeded.

This probably rules out ESMF threading as the cause.

DusanJovic-NOAA commented 1 month ago

I took this run directory /scratch1/NCEPDEV/stmp2/Kate.Zhang/RUNDIRS/TC768/fcst.424943/ which is a directory used for run that failed on Hera (See this comment: https://github.com/ufs-community/ufs-weather-model/issues/2227#issuecomment-2042958929). I ran the executable compiled from the current develop and it failed with the same or very similar error.

However when I turn off compression (set both ideflate: and zstandard_level: to 0 in model_configure) model runs for 24h without failing.

SamuelTrahanNOAA commented 1 month ago

However when I turn off compression (set both ideflate: and zstandard_level: to 0 in model_configure) model runs for 24h without failing.

My test case with the global-nest-v1 physics uses zstandard_level=4 and still runs to completion. There are other differences too. I'll try to narrow them down and find a commonality.

zhanglikate commented 1 month ago

@junwang-noaa @SamuelTrahanNOAA @DusanJovic-NOAA
I am using Sam's version and modify the ideflate: 0 as Dusan's testing, the run did not crash. I am now testing the original UFS version in the April 2 global workflow. Will keep you updating. Thanks.

zhanglikate commented 1 month ago

@junwang-noaa @SamuelTrahanNOAA @DusanJovic-NOAA I am using Sam's version and modify the ideflate: 0 as Dusan's testing, the run did not crash. I am now testing the original UFS version in the April 2 global workflow. Will keep you updating. Thanks.

The local ZSTANDARD_LEVEL=0 by default in the global workflow.

zhanglikate commented 1 month ago

@DusanJovic-NOAA @SamuelTrahanNOAA @junwang-noaa @lisa-bengtsson @ChristianBoyer-NOAA @spanNOAA @jkbk2004 @kayeekayee @JessicaMeixner-NOAA After modifying "local IDEFLATE=1" to "local IDEFLATE=0" in the ush/parsing_model_configure_FV3.sh of global workflow, the C768 run worked well in Rocky 8 Hera (April 2 version, commit c54fe98c4fe8d811907366d4ba6ff16347bf174c).

junwang-noaa commented 1 month ago

@DusanJovic-NOAA can you set ideflate to 1 and set zstanard_level to 0 (lossless compression only) to see if that works?

I took this run directory /scratch1/NCEPDEV/stmp2/Kate.Zhang/RUNDIRS/TC768/fcst.424943/ which is a directory used for run that failed on Hera (See this comment: #2227 (comment)). I ran the executable compiled from the current develop and it failed with the same or very similar error.

However when I turn off compression (set both ideflate: and zstandard_level: to 0 in model_configure) model runs for 24h without failing.

DusanJovic-NOAA commented 1 month ago

@DusanJovic-NOAA can you set ideflate to 1 and set zstanard_level to 0 (lossless compression only) to see if that works?

I took this run directory /scratch1/NCEPDEV/stmp2/Kate.Zhang/RUNDIRS/TC768/fcst.424943/ which is a directory used for run that failed on Hera (See this comment: #2227 (comment)). I ran the executable compiled from the current develop and it failed with the same or very similar error. However when I turn off compression (set both ideflate: and zstandard_level: to 0 in model_configure) model runs for 24h without failing.

ideflate was initially set to 1 and it didn't work.

SamuelTrahanNOAA commented 1 month ago

Changing ideflate to 0 didn't fix my case. The model segfaulted. I also had to change these lines:

ichunk2d:                3072
jchunk2d:                1536
ichunk3d:                3072
jchunk3d:                1536
kchunk3d:                1

To this:

ichunk2d:                0
jchunk2d:                0
ichunk3d:                0
jchunk3d:                0
kchunk3d:                0

The forecast has gotten past the previous failure point. How far it will go, I do not know.

SamuelTrahanNOAA commented 1 month ago

In my succeeding HAFS-AR nested C768 run, I have these settings:

ichunk2d:                -1
jchunk2d:                -1
ichunk3d:                -1
jchunk3d:                -1
kchunk3d:                -1
ideflate:                0
zstandard_level:         4
nbits:                   0
zhanglikate commented 1 month ago

@junwang-noaa @SamuelTrahanNOAA My testing run is more than 48 hours (~54 hours). Thanks.