ufs-community / ufs-weather-model

UFS Weather Model
Other
138 stars 249 forks source link

Much slower timings in model init with the latest ufs-weather-model #801

Closed EricRogers-NOAA closed 2 years ago

EricRogers-NOAA commented 3 years ago

Today Ben Blake checked out the latest develop branch and ran a short test on WCOSS Dell Phase 3.5 with the 3 km RRFS LAM domain over North America, coldstarted from the GFS analysis, He saw an increase in the model initialization time by almost 8 minutes compared to the current parallel LAM run:

LAM parallel: in fcst,init total time: 74.2139439582825 (#ddcd809, checked out 7/30/21) My test: in fcst,init total time: 524.866119146347 (#e198256, checked out today)

Ben's test run did not run to completion so no termination times are available. Bin Liu noted similar behavior in HAFS

climbfuji commented 3 years ago

Hi Eric, a lot of development happened between the two hashes you posted. One way to narrow this done is to use the good old bisect mode. Do you think you have time to do that? I went to https://github.com/ufs-community/ufs-weather-model/commits/develop and searched for ddcd809 then everything that got merged since then is above.

EricRogers-NOAA commented 3 years ago

I take it you mean check out a version, run a short test and see when the slowdown starts? I don't have a lot of time lately because I'm working on the WCOSS2 conversion effort, but I'll try to clear some time for this.

EricRogers-NOAA commented 3 years ago

Timing tests: IC=00z 9/16/21 3 km CONUS LAM domain

1) Control run, #ddcd809, (7/30/21): in fv3_cap, init time= 37.0698819160461 2) #4a2a127 (9/7/2021): in fv3_cap, init time= 410.151222944260 3) #01d70f4 (8/30/2021): in fv3_cap, init time= 422.296640157700 4) #3f3c253 (8/25/2021): in fv3_cap, init time= 420.641896009445 5) #b26a896 (8/23/2021): compile aborted:

/gpfs/dell6/emc/modeling/noscrub/Eric.Rogers/ufs-weather-model_aug23/FV3/atmos_cubed_sphere/tools/fv_eta.F90(49) : error #6580: Name in only-list does not exist or is not accessible. [ASCII_READ] use fms2_io_mod, only: ascii_read ---------------------------^

6) #f7cfebf (8/18/2021): compile aborted, same as above 7) #2258171 (8/13/2021): compile aborted, same as above

Why am I getting these compile aborts? I'm doing this:

git clone --recursive https://github.com/ufs-community/ufs-weather-model.git ufs-weather-model_mydir cd ufs-weather-model_mydir git checkout (hash)

then to compile:

set -x . /usrx/local/prod/lmod/lmod/init/sh module purge module use modulefiles module load ufs_wcoss_dell_p3 export CMAKE_PLATFORM=wcoss_dell_p3 export CMAKE_FLAGS="-DAPP=ATM -D32BIT=ON -DDEBUG=OFF -DCCPP_SUITES=FV3_GFS_v15_thompson_mynn_lam3km" export BUILD_VERBOSE=1 ./build.sh

DusanJovic-NOAA commented 3 years ago

@EricRogers-NOAA After checking out the exact commit (hash) you want to build, you must update submodules. Otherwise you'll be using submodules from the initial clone, which are submodules used by the current develop branch, not the submodules from the hash you want. So clone like this (note no --recursive in git clone)

git clone https://github.com/ufs-community/ufs-weather-model.git ufs-weather-model_mydir cd ufs-weather-model_mydir git checkout (hash) git submodule update --init --recursive

and then build.

EricRogers-NOAA commented 3 years ago

@DusanJovic-NOAA thank you very much. I always forget that submodule step. I was able to checkout the Aug 23 commit now. I'll be sending out an updated list of init timings later.

EricRogers-NOAA commented 3 years ago

New timing tests, with the correct checkout of earlier commits:

Timing tests: IC=00z 9/16/21 3 km CONUS LAM domain, all warm starts from LAMDA IC

1) Control run, #ddcd809, (7/30/21): in fv3_cap, init time= 37.0698819160461 2) #b26a896 (8/23/2021): in fv3_cap, init time= 39.86402297019963 3) #3f3c253 (8/25/2021): in fv3_cap, init time= 231.806740045547

The 8/25/2021 commit #762 is the cause of the slower init time.

junwang-noaa commented 3 years ago

Thanks, Dusan. @ericaligo-NOAA Thanks to identify the PR that causes the slowness of model initialization step.

@bensonr @mlee03 The #762 is the FMS lib update to 2021.03. Would you please take a look what code updates in fms might cause the slowness? Thanks.

bensonr commented 3 years ago

@junwang-noaa - a similar issue has been brought directly to my attention by the HAFS regional team. I know the reason and am trying to verify the resolution will alleviate the situation.

EricRogers-NOAA commented 2 years ago

Latest UFS code put into LAM parallels (#805421d) on 11/30/2021. Init time for the RRFS domain went from ~140 sec to almost 30 minutes:

err: in fcst,init total time: 1970.07599091530

JacobCarley-NOAA commented 2 years ago

@junwang-noaa @arunchawla-NOAA Do we have any updates on this? When testing the model on Cray TO4 (Luna/Surge) the model takes over 3000s to initialize.

bensonr commented 2 years ago

@junwang-noaa @arunchawla-NOAA @JacobCarley-NOAA - please try your tests with this version of fv3atm. Make sure to use io_layout=1,1 to test the initialization performance.

This branch also contains a fix to the restart checksum issue you've been wanting removed. The option is controlled separately for the dycore and the physics. In the dycore, one needs to add fv_core_nml::ignore_rst_cksum=.true. and for the physics, atmos_model_nml::ignore_rst_cksum=.true. for use within FV3GFS_io.F90. If you don't want the option to be in atmos_model.F90 but exist in FV3GFS_io.F90 itself, feel free to reimplement as you see fit.

Once you are satisfied with the results of your testing, please merge the changes into your own branches and add the appropriate PRs.

junwang-noaa commented 2 years ago

@bensonr Thank you very much for making the code changes. @BinLiu-NOAA @EricRogers-NOAA FYI,

EricRogers-NOAA commented 2 years ago

How would I check this out and compile w.r.t. to the full model (https://github.com/ufs-community/ufs-weather-model)? I've always just cloned the https://github.com/ufs-community/ufs-weather-model (and maybe checkout a feature branch) and have not had experience dealing with a different version of fv3atm or some other submodule. Thanks for your assistance.

junwang-noaa commented 2 years ago

@EricRogers-NOAA I am creating a ufs-weather-model branch from the latest develop branch using Rusty's fv3atm, we can use it for testing. I will let you know when I am done.

bensonr commented 2 years ago

Or simply

junwang-noaa commented 2 years ago

Thanks, Rusty, that will work too. Anyway, @EricRogers-NOAA @BinLiu-NOAA Here is the branch for testing:

https://github.com/junwang-noaa/ufs-weather-model/tree/checksum_io

BinLiu-NOAA commented 2 years ago

Thanks a lot, @bensonr @junwang-noaa! We will test from the HAFS side and report back on how this new branch perform for the model forecast init phase when using io_layout of (1x1) for both cold-start and warm-start scenarios. We will also test the capability of skipping the checksum step. Thanks!

EricRogers-NOAA commented 2 years ago

@junwang-noaa my compile failed on WCOSS Dell:

CMake Error at FV3/CMakeLists.txt:21 (message): An error occured while running ccpp_prebuild.py, check /gpfs/dell6/emc/modeling/noscrub/Eric.Rogers/emc_io_fixes/build/FV3/ccpp_prebuild.{out,err}

I had been using this to compile the code, I take it there have been changes:

!/bin/bash

set -x . /usrx/local/prod/lmod/lmod/init/sh module purge module use modulefiles module load ufs_wcoss_dell_p3 export CMAKE_PLATFORM=wcoss_dell_p3

export CMAKE_FLAGS="-DAPP=ATM -D32BIT=ON -DDEBUG=ON -DCCPP_SUITES=FV3_GFS_v15_thompson_mynn_lam3km"

export CMAKE_FLAGS="-DAPP=ATM -D32BIT=ON -DDEBUG=OFF -DCCPP_SUITES=FV3_GFS_v15_thompson_mynn_lam3km" export BUILD_VERBOSE=1 ./build.sh

junwang-noaa commented 2 years ago

I run the RT tests on Orion, it works. Let me check Dell.

junwang-noaa commented 2 years ago

@EricRogers-NOAA The code compiled on dell. Here is what I did:

Jun.Wang@v71a1 ufs-weather-model]$ pwd /gpfs/dell1/ptmp/Jun.Wang/ufs-weather-model module purge module use -a /gpfs/dell1/ptmp/Jun.Wang/ufs-weather-model/modulefiles module load ufs_wcoss_dell_p3 module list export CMAKE_FLAGS="-DAPP=ATM -DCCPP_SUITES=FV3_GFS_v15_thompson_mynn_lam3km -D32BIT=ON" nohup ./build.sh >xxxcmpl 2>&1 &

BenjaminBlake-NOAA commented 2 years ago

@junwang-noaa I tried compiling your branch using your commands listed above, but I got the same error as Eric. I looked in my build/FV3/ccpp/ccpp_prebuild.err file and I saw the following message at the end:

KeyError: 'rrtmg_sw_pre'

The FV3_GFS_v15_thompson_mynn_lam3km suite file we were using did contain rrtmg_sw_pre, but I see it was replaced by rad_sw_pre in the repository. The xml file we are using is slightly different because it uses the unified GWD scheme. After making that change, the code compiled for me. @EricRogers-NOAA give that a try and see if it works for you (I used your original compile.sh)

EricRogers-NOAA commented 2 years ago

I've got the new code running on WCOSS Dell P3; I saw this print:

Computing rain collecting graupel table took 226.539 seconds. creating rain collecting snow table Computing rain collecting snow table took 31.857 seconds.

Are there new tables we need to read in that will eliminate the above computations and reduce run time?

The run subsequently aborted a few minutes after the above print,

EricRogers-NOAA commented 2 years ago

One of the ESMF debug prints had this in the failed run of the new code:

20220505 184831.351 ERROR PET1875 src/addon/NUOPC/src/NUOPC_Base.F90:2101 Invalid argument - inst_tracer_diag_aod is not a StandardName in the NUOPC_FieldDictionary! 20220505 184831.351 ERROR PET1875 src/addon/NUOPC/src/NUOPC_Base.F90:480 Invalid argument - Passing error in return code 20220505 184831.351 ERROR PET1875 module_fcst_grid_comp.F90:410 Invalid argument - Passing error in return code 20220505 184831.354 ERROR PET1875 module_fcst_grid_comp.F90:1079 Invalid argument - Passing error in return code 20220505 184831.354 ERROR PET1875 fv3_cap.F90:888 Invalid argument - Passing error in return code 20220505 184831.354 ERROR PET1875 ATM:src/addon/NUOPC/src/NUOPC_ModelBase.F90:700 Invalid argument - Passing error in return code 20220505 184831.354 ERROR PET1875 EARTH Grid Comp:src/addon/NUOPC/src/NUOPC_Driver.F90:2577 Invalid argument - Phase 'IPDvXp01' Initialize for modelComp 1: ATM did not return ESMF_SUCCESS 20220505 184831.354 ERROR PET1875 EARTH Grid Comp:src/addon/NUOPC/src/NUOPC_Driver.F90:1286 Invalid argument - Passing error in return code 20220505 184831.354 ERROR PET1875 EARTH Grid Comp:src/addon/NUOPC/src/NUOPC_Driver.F90:457 Invalid argument - Passing error in return code 20220505 184831.354 ERROR PET1875 UFS.F90:386 Invalid argument - Aborting UFS

junwang-noaa commented 2 years ago

@EricRogers-NOAA Please update the fd_nems.yaml file in the run directory from the latest develop branch:

https://github.com/ufs-community/ufs-weather-model/blob/develop/tests/parm/fd_nems.yaml

ericaligo-NOAA commented 2 years ago

I'm not using the latest code, so I don't know if new tables have been added.  Ruiyu, do you know if the latest ufs-weather-model code requires new tables to be read in by the Thompson scheme?

On 5/5/2022 2:50 PM, EricRogers-NOAA wrote:

I've got the new code running on WCOSS Dell P3; I saw this print:

Computing rain collecting graupel table took 226.539 seconds. creating rain collecting snow table Computing rain collecting snow table took 31.857 seconds.

Are there new tables we need to read in that will eliminate the above computations and reduce run time?

— Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/issues/801#issuecomment-1118933828, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALQ75MJ6Q6N3V5KRTNTBQJ3VIQJ5VANCNFSM5EBEOZQQ. You are receiving this because you were mentioned.Message ID: @.***>

RuiyuSun commented 2 years ago

I'm not using the latest code, so I don't know if new tables have been added.  Ruiyu, do you know if the latest ufs-weather-model code requires new tables to be read in by the Thompson scheme? On 5/5/2022 2:50 PM, EricRogers-NOAA wrote: I've got the new code running on WCOSS Dell P3; I saw this print: Computing rain collecting graupel table took 226.539 seconds. creating rain collecting snow table Computing rain collecting snow table took 31.857 seconds. Are there new tables we need to read in that will eliminate the above computations and reduce run time? — Reply to this email directly, view it on GitHub <#801 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALQ75MJ6Q6N3V5KRTNTBQJ3VIQJ5VANCNFSM5EBEOZQQ. You are receiving this because you were mentioned.Message ID: @.***> @ericaligo-NOAA I am not aware of any new tables required. If the tables are being created at the initial time that probably means the existing table files are not copied to the run directory.

SMoorthi-emc commented 2 years ago

Ruiyu, Where are these precomputed tables? I also saw them being created while running.

On Thu, May 5, 2022 at 11:04 PM RuiyuSun @.***> wrote:

I'm not using the latest code, so I don't know if new tables have been added. Ruiyu, do you know if the latest ufs-weather-model code requires new tables to be read in by the Thompson scheme? … <#m5980202897869194547> On 5/5/2022 2:50 PM, EricRogers-NOAA wrote: I've got the new code running on WCOSS Dell P3; I saw this print: Computing rain collecting graupel table took 226.539 seconds. creating rain collecting snow table Computing rain collecting snow table took 31.857 seconds. Are there new tables we need to read in that will eliminate the above computations and reduce run time? — Reply to this email directly, view it on GitHub <#801 (comment) https://github.com/ufs-community/ufs-weather-model/issues/801#issuecomment-1118933828>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALQ75MJ6Q6N3V5KRTNTBQJ3VIQJ5VANCNFSM5EBEOZQQ. You are receiving this because you were mentioned.Message ID: @.***> @ericaligo-NOAA https://github.com/ericaligo-NOAA I am not aware of any new tables required. If the tables are being created at the initial time that probably means the existing table files are not copied to the run directory.

— Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/issues/801#issuecomment-1119218578, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALLVRYRSEDPYM2KHFHIHBZLVISD4XANCNFSM5EBEOZQQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Dr. Shrinivas Moorthi Research Meteorologist Modeling and Data Assimilation Branch Environmental Modeling Center / National Centers for Environmental Prediction 5830 University Research Court - (W/NP23), College Park MD 20740 USA Tel: (301)683-3718

e-mail: @.*** Phone: (301) 683-3718 Fax: (301) 683-3718

RuiyuSun commented 2 years ago

You can use the tables created in your current/previous experiment for your future experiments. I didn't find them in the current ufs_weather_model. They need to be added @yangfanglin. qr_acr_qgV2.dat, qr_acr_qsV2.dat, freezeH2O.dat

SMoorthi-emc commented 2 years ago

Ruiyu, do they depend on resolution?

On Fri, May 6, 2022 at 9:32 AM RuiyuSun @.***> wrote:

You can use the tables created in your current/previous experiment for your future experiments. I didn't find them in the current ufs_weather_model. They need to be added @yangfanglin https://github.com/yangfanglin. qr_acr_qgV2.dat, qr_acr_qsV2.dat, freezeH2O.dat

— Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/issues/801#issuecomment-1119627526, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALLVRYQUHBXYMVDRWND6RH3VIUNOVANCNFSM5EBEOZQQ . You are receiving this because you commented.Message ID: @.***>

-- Dr. Shrinivas Moorthi Research Meteorologist Modeling and Data Assimilation Branch Environmental Modeling Center / National Centers for Environmental Prediction 5830 University Research Court - (W/NP23), College Park MD 20740 USA Tel: (301)683-3718

e-mail: @.*** Phone: (301) 683-3718 Fax: (301) 683-3718

RuiyuSun commented 2 years ago

I don't think so

Ruiyu, do they depend on resolution? On Fri, May 6, 2022 at 9:32 AM RuiyuSun @.> wrote: You can use the tables created in your current/previous experiment for your future experiments. I didn't find them in the current ufs_weather_model. They need to be added @yangfanglin https://github.com/yangfanglin. qr_acr_qgV2.dat, qr_acr_qsV2.dat, freezeH2O.dat — Reply to this email directly, view it on GitHub <#801 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALLVRYQUHBXYMVDRWND6RH3VIUNOVANCNFSM5EBEOZQQ . You are receiving this because you commented.Message ID: @.> -- Dr. Shrinivas Moorthi Research Meteorologist Modeling and Data Assimilation Branch Environmental Modeling Center / National Centers for Environmental Prediction 5830 University Research Court - (W/NP23), College Park MD 20740 USA Tel: (301)683-3718 e-mail: @.*** Phone: (301) 683-3718 Fax: (301) 683-3718

JiliDong-NOAA commented 2 years ago

You can use the tables created in your current/previous experiment for your future experiments. I didn't find them in the current ufs_weather_model. They need to be added @yangfanglin. qr_acr_qgV2.dat, qr_acr_qsV2.dat, freezeH2O.dat

I believe they are in the fix file directory:

Hera: /scratch1/NCEPDEV/global/glopara/fix/fix_am/

Jet: /lfs4/HFIP/hfv3gfs/glopara/git/fv3gfs/fix/fix_am/

Orion: /work/noaa/global/glopara/fix/fix_am/

DeniseWorthen commented 2 years ago

For UWM regression testing, the files freezeH2O.dat, qr_acr_qgV2.dat and qr_acr_qsV2.dat are located in the input-data-directory in FV3_fix:

/scratch1/NCEPDEV/nems/emc.nemspara/RT/NEMSfv3gfs/input-data-20220414/FV3_fix

EricRogers-NOAA commented 2 years ago

Thanks all for your comments concerning the tables. This was my mistake, in the real-time LAM parallels we do use the pre-computed qr_acr_qgV2.dat and qr_acr_qsV2.dat files. But for my testing with the the new code w/Rusty's I/O changes I was running a canned case, and I forgot to add the "V2" files as input for this canned case. Sorry for the inconvenience.

EricRogers-NOAA commented 2 years ago

@junwang-noaa Ben and I got a successful 12-h forecast of the 3km RRFS NA domain with your code with Rusty's I/O changes. Here are the fcst_initialize times with io_layout=1,1

Cold start (on Dell P3.5): fcst_initialize total time: 73.7074041366577 Warm start (on Dell P3): fcst_initialize total time: 365.096565008163

With the current code run in the LAM parallels, fcst_initialize time for warm starts with io_layout=1,1 was about one hour.

A rerun of the warm start case with io_layout=1,1 on Dell P3 this afternoon gave a much lower fcst_initialize time: fcst_initialize total time: 87.6470549106598

One warm start test, with io_layout=1,15 : fcst_initialize total time: 187.195601940155. This is roughly the same as seen in the 3 km RRFS DA parallel warm start forecasts.

yangfanglin commented 2 years ago

They are available in the shared/standard/common "FIX" directory. They do not dependent on resolution.

HERA: /scratch1/NCEPDEV/global/glopara/fix_NEW/fix_am WCOSS: /gpfs/dell2/emc/modeling/noscrub/emc.glopara/git/fv3gfs/fix_NEW/fix_am ORION: /work/noaa/global/glopara/fix_NEW/fix_am/

For global model, there are links included in the forecast script exglobal_forecast.sh $NLN $FIX_AM/CCN_ACTIVATE.BIN $DATA/. $NLN $FIX_AM/freezeH2O.dat $DATA/. $NLN $FIX_AM/qr_acr_qg.dat $DATA/. $NLN $FIX_AM/qr_acr_qs.dat $DATA/.

BinLiu-NOAA commented 2 years ago

@junwang-noaa and @bensonr, I got a chance to test HAFS single 3-km domain warm-start run on Orion. With io_layout=1,1, I got: fcst_initialize total time: 254.639430604875 fcst_initialize total time: 297.894324436784 for two consecutive forecast cycles, which is much faster than previously. With this speed up, it is now even faster than using io_layout=1,10 for the 20220412 develop version of ufs-weather-model (which took fcst_initialize total time: 637.429518222809).

So, thank again for fixing this slow-down issue! With that, we most likely will not need the mppnccombine workaround step anymore.

Next, I will test the namelist options to skip the checksum step for the warm-restart files (after GSI/DA updates) and report back to this thread.

BinLiu-NOAA commented 2 years ago

@bensonr and @junwang-noaa, just a quick follow-up: while I am testing the fv_core_nml::ignore_rst_cksum=.true. and atmos_model_nml::ignore_rst_cksum=.true. options to skip the checksum for the restart files, I got the forecast failure with this error: FATAL from PE 974: check_nml_error in fms_mod: Unknown namelist, or mistyped namelist variable in namelist atmos_model_nml, (IOSTAT = 19 ) However, if I only keep fv_core_nml::ignore_rst_cksum=.true., the forecast can go through. By any chance, I somehow missconfigured something? Appreciate it if you can help to take a look when you get a chance. Thanks!

Bin

bensonr commented 2 years ago

@BinLiu-NOAA - please make sure you are using the emc_io_fixes branch for fv3atm. I see the namelist variable ignore_rst_cksum in atmos_model.F90 and it doesn't look like there are any typos.

BinLiu-NOAA commented 2 years ago

@bensonr and @junwang-noaa, I made sure to to use your emc_io_fixes branch for fv3atm. Another question is: does this ignore_rst_cksum=.true. option need a newer version than fms/2022.01 to work properly? My current test uses fms/2022.01.

Meanwhile, wondering if anyone else can test this ignore_rst_cksum option from his/her side.

Thanks!

Bin

bensonr commented 2 years ago

@BinLiu-NOAA - Your test for dycore restarts was successful with the current v2022.01 library, so the library and logic in the dycore are not an issue. The error you are encountering is related to the atmos_model_nml namelist in the input.nml or the specific namelist definition in atmos_model.F90. I urge you to check both the source file and your input.nml atmos_model_nml entry (or whatever builds the atmos_model_nml in the UFS) to ensure all spellings of ignore_rst_cksum are correct or at least consistent.

BinLiu-NOAA commented 2 years ago

@bensonr and @junwang-noaa, I think I might have found the issue now. This is because the ignore_rst_cksum option is only added here in FV3/atmos_model.F90:

logical :: ignore_rst_cksum = .false.
namelist /atmos_model_nml/ blocksize, chksum_debug, dycore_only, debug, sync, ccpp_suite, avg_max_length, &
                           ignore_rst_cksum

However, it was not updated here in FV3/atmos_cubed_sphere/driver/fvGFS/atmosphere.F90: namelist /atmos_model_nml/ blocksize, chksum_debug, dycore_only, debug, sync, fdiag, fhmax, fhmaxhf, fhout, fhouthf, ccpp_suite, avg_max_length

I believe once we add ignore_rst_cksum in FV3/atmos_cubed_sphere/driver/fvGFS/atmosphere.F90, the ignore_rst_cksum option will work properly under the atmos_model_nml section as well.

Thanks!

bensonr commented 2 years ago

@BinLiu-NOAA - Feel free to merge my work into your own work and make the fix. But please remind me again why the UFS has two different modules from two different repositories reading the same namelist?

BinLiu-NOAA commented 2 years ago

@BinLiu-NOAA - Feel free to merge my work into your own work and make the fix. But please remind me again why the UFS has two different modules from two different repositories reading the same namelist?

@bensonr, I saw some notes near FV3/atmos_cubed_sphere/driver/fvGFS/atmosphere.F90, from which it looks to me, this is a temporary workaround for something. And it would be nice if this can be fixed in the future. @climbfuji and @junwang-noaa might know more background/context for this. Thanks!

junwang-noaa commented 2 years ago

If I remember correctly. it is due to the calling sequence of the ccpp physics subroutine. @climbfuji can correct me if this is not the case.

BinLiu-NOAA commented 2 years ago

Thanks @junwang-noaa for the background!

Also, @bensonr, a quick follow-up, after adding the ignore_rst_cksum item in FV3/atmos_cubed_sphere/driver/fvGFS/atmosphere.F90, the ignore_rst_cksum worked fine for my HAFS tests. @EricRogers-NOAA and @BenjaminBlake-NOAA also tested from the RRFS side and confirmed it worked as well.

Thanks again for the speeding-up for FMS2IO as well as for the option to ignore checksum for restart files! With that we do not need to use the mppnccombine workaround in the HAFS application/workflow anymore, meanwhile can avoid the ncatted command to delete the checksum attribute from the analysis updated restart files. Much appreciated! - Bin

junwang-noaa commented 2 years ago

@bensonr I did some testing with gdas case using the GFSv16 gdas restart files with checksum in the ICs, I still got the error when reading the restart files with ignore_rst_cksum=.true..

0: in atmosphere bf fv_restart, ignore_rst_cksum= T 0: in fv_restart ncnst= 9 0: FV_RESTART: 1 T F 0: Warm starting, calling fv_io_restart 0: ptop & ks 0.9990000 39 0: 0: FATAL from PE 0: The checksum in the file:INPUT/fv_core.res.tile1.nc and variable:u does not match the checksum calculated from the data. file:D0AC0578D35B44A from data:FFF2C0BC98877DB6

Do you have any idea what I might miss? Thanks

BinLiu-NOAA commented 2 years ago

@junwang-noaa, you might want to double check if you have ignore_rst_cksum in both fv_core_nml and atmos_model_nml sections. (for dycore fv_core_nml::ignore_rst_cksum=.true. and for the physics, atmos_model_nml::ignore_rst_cksum=.true.)

EricRogers-NOAA commented 2 years ago

@BinLiu-NOAA @BenjaminBlake-NOAA @junwang-noaa : I'm testing the code in the RRFS 3 km North American DA run, I see significant slowdown in GSI analysis times when the model is run with io_layout=1,1, and the input first guess to the GSI are model restart files:

New code : regional_gsianl_tm03_12.log:The total amount of wall time = 1754.419832 regional_gsianl_tm04_12.log:The total amount of wall time = 2306.498491 regional_gsianl_tm05_12.log:The total amount of wall time = 2135.666786 Old code: regional_gsianl_tm03_12.log:The total amount of wall time = 1421.840473 regional_gsianl_tm04_12.log:The total amount of wall time = 1472.981763 regional_gsianl_tm05_12.log:The total amount of wall time = 1433.479971

I believe this is a chunk size issue with the model restart files. When we run the current model code in the RRFS DA parallels with io_layout=1,15, we use the mppnccombine utility to combine the 15 restart file pieces back into one file, using the "-64" option (for netcdf-3 64-bit offset, for which chunksizes are not used)

echo "$EXECfv3/mppnccombine -v -64 fv_core.res.tile1.nc" > cmdfile echo "$EXECfv3/mppnccombine -v -64 fv_tracer.res.tile1.nc" >> cmdfile

But with the new code, we can run with io_layout=1,1 so the netcdf-4 restart files written directly out of the model are used in the GSI, and because of the default chunksizes (the files dimensions are 3950x2701x65)

    float ua(Time, zaxis_1, yaxis_2, xaxis_1) ;
            ua:checksum = " 18F05251E151753" ;
            ua:_Storage = "chunked" ;
            ua:_ChunkSizes = 1, 6, 300, 439 ;
            ua:_Endianness = "little" ;

GSI timings with its parallel I/O are degraded.

junwang-noaa commented 2 years ago

@BinLiu-NOAA Thanks for the suggestion, the test ran through. Here are some timing:

All testing is using io_layout (1.1) 1) From the develop branch, the c384gdas test: 0: in fv3_cap, init time= 68.9708864763379 2) From Rusty's fv3atm branch: 0: in fv3_cap, init time= 20.7666180133820 3) From Rusty's fv3atm branch with ignore_rst_cksum=.true. 0: in fv3_cap, init time= 16.9400908946991

junwang-noaa commented 2 years ago

@EricRogers-NOAA So the only change you made is io_layout change? Also I am curious, did you specify the chunksizes for restart files? I think you can still use io_layout= 1,15 if that helps with the total run time.

My understanding is that using more io tasks in io_layout will speed up reading the restart files