Closed EricRogers-NOAA closed 2 years ago
Hi Eric, a lot of development happened between the two hashes you posted. One way to narrow this done is to use the good old bisect mode. Do you think you have time to do that? I went to https://github.com/ufs-community/ufs-weather-model/commits/develop and searched for ddcd809 then everything that got merged since then is above.
I take it you mean check out a version, run a short test and see when the slowdown starts? I don't have a lot of time lately because I'm working on the WCOSS2 conversion effort, but I'll try to clear some time for this.
Timing tests: IC=00z 9/16/21 3 km CONUS LAM domain
1) Control run, #ddcd809, (7/30/21): in fv3_cap, init time= 37.0698819160461 2) #4a2a127 (9/7/2021): in fv3_cap, init time= 410.151222944260 3) #01d70f4 (8/30/2021): in fv3_cap, init time= 422.296640157700 4) #3f3c253 (8/25/2021): in fv3_cap, init time= 420.641896009445 5) #b26a896 (8/23/2021): compile aborted:
/gpfs/dell6/emc/modeling/noscrub/Eric.Rogers/ufs-weather-model_aug23/FV3/atmos_cubed_sphere/tools/fv_eta.F90(49) : error #6580: Name in only-list does not exist or is not accessible. [ASCII_READ] use fms2_io_mod, only: ascii_read ---------------------------^
6) #f7cfebf (8/18/2021): compile aborted, same as above 7) #2258171 (8/13/2021): compile aborted, same as above
Why am I getting these compile aborts? I'm doing this:
git clone --recursive https://github.com/ufs-community/ufs-weather-model.git ufs-weather-model_mydir cd ufs-weather-model_mydir git checkout (hash)
then to compile:
set -x . /usrx/local/prod/lmod/lmod/init/sh module purge module use modulefiles module load ufs_wcoss_dell_p3 export CMAKE_PLATFORM=wcoss_dell_p3 export CMAKE_FLAGS="-DAPP=ATM -D32BIT=ON -DDEBUG=OFF -DCCPP_SUITES=FV3_GFS_v15_thompson_mynn_lam3km" export BUILD_VERBOSE=1 ./build.sh
@EricRogers-NOAA After checking out the exact commit (hash) you want to build, you must update submodules. Otherwise you'll be using submodules from the initial clone, which are submodules used by the current develop branch, not the submodules from the hash you want. So clone like this (note no --recursive in git clone)
git clone https://github.com/ufs-community/ufs-weather-model.git ufs-weather-model_mydir cd ufs-weather-model_mydir git checkout (hash) git submodule update --init --recursive
and then build.
@DusanJovic-NOAA thank you very much. I always forget that submodule step. I was able to checkout the Aug 23 commit now. I'll be sending out an updated list of init timings later.
New timing tests, with the correct checkout of earlier commits:
Timing tests: IC=00z 9/16/21 3 km CONUS LAM domain, all warm starts from LAMDA IC
1) Control run, #ddcd809, (7/30/21): in fv3_cap, init time= 37.0698819160461 2) #b26a896 (8/23/2021): in fv3_cap, init time= 39.86402297019963 3) #3f3c253 (8/25/2021): in fv3_cap, init time= 231.806740045547
The 8/25/2021 commit #762 is the cause of the slower init time.
Thanks, Dusan. @ericaligo-NOAA Thanks to identify the PR that causes the slowness of model initialization step.
@bensonr @mlee03 The #762 is the FMS lib update to 2021.03. Would you please take a look what code updates in fms might cause the slowness? Thanks.
@junwang-noaa - a similar issue has been brought directly to my attention by the HAFS regional team. I know the reason and am trying to verify the resolution will alleviate the situation.
Latest UFS code put into LAM parallels (#805421d) on 11/30/2021. Init time for the RRFS domain went from ~140 sec to almost 30 minutes:
err: in fcst,init total time: 1970.07599091530
@junwang-noaa @arunchawla-NOAA Do we have any updates on this? When testing the model on Cray TO4 (Luna/Surge) the model takes over 3000s to initialize.
@junwang-noaa @arunchawla-NOAA @JacobCarley-NOAA - please try your tests with this version of fv3atm. Make sure to use io_layout=1,1 to test the initialization performance.
This branch also contains a fix to the restart checksum issue you've been wanting removed. The option is controlled separately for the dycore and the physics. In the dycore, one needs to add fv_core_nml::ignore_rst_cksum=.true. and for the physics, atmos_model_nml::ignore_rst_cksum=.true. for use within FV3GFS_io.F90. If you don't want the option to be in atmos_model.F90 but exist in FV3GFS_io.F90 itself, feel free to reimplement as you see fit.
Once you are satisfied with the results of your testing, please merge the changes into your own branches and add the appropriate PRs.
@bensonr Thank you very much for making the code changes. @BinLiu-NOAA @EricRogers-NOAA FYI,
How would I check this out and compile w.r.t. to the full model (https://github.com/ufs-community/ufs-weather-model)? I've always just cloned the https://github.com/ufs-community/ufs-weather-model (and maybe checkout a feature branch) and have not had experience dealing with a different version of fv3atm or some other submodule. Thanks for your assistance.
@EricRogers-NOAA I am creating a ufs-weather-model branch from the latest develop branch using Rusty's fv3atm, we can use it for testing. I will let you know when I am done.
Or simply
Thanks, Rusty, that will work too. Anyway, @EricRogers-NOAA @BinLiu-NOAA Here is the branch for testing:
https://github.com/junwang-noaa/ufs-weather-model/tree/checksum_io
Thanks a lot, @bensonr @junwang-noaa! We will test from the HAFS side and report back on how this new branch perform for the model forecast init phase when using io_layout of (1x1) for both cold-start and warm-start scenarios. We will also test the capability of skipping the checksum step. Thanks!
@junwang-noaa my compile failed on WCOSS Dell:
CMake Error at FV3/CMakeLists.txt:21 (message): An error occured while running ccpp_prebuild.py, check /gpfs/dell6/emc/modeling/noscrub/Eric.Rogers/emc_io_fixes/build/FV3/ccpp_prebuild.{out,err}
I had been using this to compile the code, I take it there have been changes:
set -x . /usrx/local/prod/lmod/lmod/init/sh module purge module use modulefiles module load ufs_wcoss_dell_p3 export CMAKE_PLATFORM=wcoss_dell_p3
export CMAKE_FLAGS="-DAPP=ATM -D32BIT=ON -DDEBUG=OFF -DCCPP_SUITES=FV3_GFS_v15_thompson_mynn_lam3km" export BUILD_VERBOSE=1 ./build.sh
I run the RT tests on Orion, it works. Let me check Dell.
@EricRogers-NOAA The code compiled on dell. Here is what I did:
Jun.Wang@v71a1 ufs-weather-model]$ pwd /gpfs/dell1/ptmp/Jun.Wang/ufs-weather-model module purge module use -a /gpfs/dell1/ptmp/Jun.Wang/ufs-weather-model/modulefiles module load ufs_wcoss_dell_p3 module list export CMAKE_FLAGS="-DAPP=ATM -DCCPP_SUITES=FV3_GFS_v15_thompson_mynn_lam3km -D32BIT=ON" nohup ./build.sh >xxxcmpl 2>&1 &
@junwang-noaa I tried compiling your branch using your commands listed above, but I got the same error as Eric. I looked in my build/FV3/ccpp/ccpp_prebuild.err file and I saw the following message at the end:
KeyError: 'rrtmg_sw_pre'
The FV3_GFS_v15_thompson_mynn_lam3km suite file we were using did contain rrtmg_sw_pre, but I see it was replaced by rad_sw_pre in the repository. The xml file we are using is slightly different because it uses the unified GWD scheme. After making that change, the code compiled for me. @EricRogers-NOAA give that a try and see if it works for you (I used your original compile.sh)
I've got the new code running on WCOSS Dell P3; I saw this print:
Computing rain collecting graupel table took 226.539 seconds. creating rain collecting snow table Computing rain collecting snow table took 31.857 seconds.
Are there new tables we need to read in that will eliminate the above computations and reduce run time?
The run subsequently aborted a few minutes after the above print,
One of the ESMF debug prints had this in the failed run of the new code:
20220505 184831.351 ERROR PET1875 src/addon/NUOPC/src/NUOPC_Base.F90:2101 Invalid argument - inst_tracer_diag_aod is not a StandardName in the NUOPC_FieldDictionary! 20220505 184831.351 ERROR PET1875 src/addon/NUOPC/src/NUOPC_Base.F90:480 Invalid argument - Passing error in return code 20220505 184831.351 ERROR PET1875 module_fcst_grid_comp.F90:410 Invalid argument - Passing error in return code 20220505 184831.354 ERROR PET1875 module_fcst_grid_comp.F90:1079 Invalid argument - Passing error in return code 20220505 184831.354 ERROR PET1875 fv3_cap.F90:888 Invalid argument - Passing error in return code 20220505 184831.354 ERROR PET1875 ATM:src/addon/NUOPC/src/NUOPC_ModelBase.F90:700 Invalid argument - Passing error in return code 20220505 184831.354 ERROR PET1875 EARTH Grid Comp:src/addon/NUOPC/src/NUOPC_Driver.F90:2577 Invalid argument - Phase 'IPDvXp01' Initialize for modelComp 1: ATM did not return ESMF_SUCCESS 20220505 184831.354 ERROR PET1875 EARTH Grid Comp:src/addon/NUOPC/src/NUOPC_Driver.F90:1286 Invalid argument - Passing error in return code 20220505 184831.354 ERROR PET1875 EARTH Grid Comp:src/addon/NUOPC/src/NUOPC_Driver.F90:457 Invalid argument - Passing error in return code 20220505 184831.354 ERROR PET1875 UFS.F90:386 Invalid argument - Aborting UFS
@EricRogers-NOAA Please update the fd_nems.yaml file in the run directory from the latest develop branch:
https://github.com/ufs-community/ufs-weather-model/blob/develop/tests/parm/fd_nems.yaml
I'm not using the latest code, so I don't know if new tables have been added. Ruiyu, do you know if the latest ufs-weather-model code requires new tables to be read in by the Thompson scheme?
On 5/5/2022 2:50 PM, EricRogers-NOAA wrote:
I've got the new code running on WCOSS Dell P3; I saw this print:
Computing rain collecting graupel table took 226.539 seconds. creating rain collecting snow table Computing rain collecting snow table took 31.857 seconds.
Are there new tables we need to read in that will eliminate the above computations and reduce run time?
— Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/issues/801#issuecomment-1118933828, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALQ75MJ6Q6N3V5KRTNTBQJ3VIQJ5VANCNFSM5EBEOZQQ. You are receiving this because you were mentioned.Message ID: @.***>
I'm not using the latest code, so I don't know if new tables have been added. Ruiyu, do you know if the latest ufs-weather-model code requires new tables to be read in by the Thompson scheme? … On 5/5/2022 2:50 PM, EricRogers-NOAA wrote: I've got the new code running on WCOSS Dell P3; I saw this print: Computing rain collecting graupel table took 226.539 seconds. creating rain collecting snow table Computing rain collecting snow table took 31.857 seconds. Are there new tables we need to read in that will eliminate the above computations and reduce run time? — Reply to this email directly, view it on GitHub <#801 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALQ75MJ6Q6N3V5KRTNTBQJ3VIQJ5VANCNFSM5EBEOZQQ. You are receiving this because you were mentioned.Message ID: @.***> @ericaligo-NOAA I am not aware of any new tables required. If the tables are being created at the initial time that probably means the existing table files are not copied to the run directory.
Ruiyu, Where are these precomputed tables? I also saw them being created while running.
On Thu, May 5, 2022 at 11:04 PM RuiyuSun @.***> wrote:
I'm not using the latest code, so I don't know if new tables have been added. Ruiyu, do you know if the latest ufs-weather-model code requires new tables to be read in by the Thompson scheme? … <#m5980202897869194547> On 5/5/2022 2:50 PM, EricRogers-NOAA wrote: I've got the new code running on WCOSS Dell P3; I saw this print: Computing rain collecting graupel table took 226.539 seconds. creating rain collecting snow table Computing rain collecting snow table took 31.857 seconds. Are there new tables we need to read in that will eliminate the above computations and reduce run time? — Reply to this email directly, view it on GitHub <#801 (comment) https://github.com/ufs-community/ufs-weather-model/issues/801#issuecomment-1118933828>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALQ75MJ6Q6N3V5KRTNTBQJ3VIQJ5VANCNFSM5EBEOZQQ. You are receiving this because you were mentioned.Message ID: @.***> @ericaligo-NOAA https://github.com/ericaligo-NOAA I am not aware of any new tables required. If the tables are being created at the initial time that probably means the existing table files are not copied to the run directory.
— Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/issues/801#issuecomment-1119218578, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALLVRYRSEDPYM2KHFHIHBZLVISD4XANCNFSM5EBEOZQQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>
-- Dr. Shrinivas Moorthi Research Meteorologist Modeling and Data Assimilation Branch Environmental Modeling Center / National Centers for Environmental Prediction 5830 University Research Court - (W/NP23), College Park MD 20740 USA Tel: (301)683-3718
e-mail: @.*** Phone: (301) 683-3718 Fax: (301) 683-3718
You can use the tables created in your current/previous experiment for your future experiments. I didn't find them in the current ufs_weather_model. They need to be added @yangfanglin. qr_acr_qgV2.dat, qr_acr_qsV2.dat, freezeH2O.dat
Ruiyu, do they depend on resolution?
On Fri, May 6, 2022 at 9:32 AM RuiyuSun @.***> wrote:
You can use the tables created in your current/previous experiment for your future experiments. I didn't find them in the current ufs_weather_model. They need to be added @yangfanglin https://github.com/yangfanglin. qr_acr_qgV2.dat, qr_acr_qsV2.dat, freezeH2O.dat
— Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/issues/801#issuecomment-1119627526, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALLVRYQUHBXYMVDRWND6RH3VIUNOVANCNFSM5EBEOZQQ . You are receiving this because you commented.Message ID: @.***>
-- Dr. Shrinivas Moorthi Research Meteorologist Modeling and Data Assimilation Branch Environmental Modeling Center / National Centers for Environmental Prediction 5830 University Research Court - (W/NP23), College Park MD 20740 USA Tel: (301)683-3718
e-mail: @.*** Phone: (301) 683-3718 Fax: (301) 683-3718
I don't think so
Ruiyu, do they depend on resolution? … On Fri, May 6, 2022 at 9:32 AM RuiyuSun @.> wrote: You can use the tables created in your current/previous experiment for your future experiments. I didn't find them in the current ufs_weather_model. They need to be added @yangfanglin https://github.com/yangfanglin. qr_acr_qgV2.dat, qr_acr_qsV2.dat, freezeH2O.dat — Reply to this email directly, view it on GitHub <#801 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALLVRYQUHBXYMVDRWND6RH3VIUNOVANCNFSM5EBEOZQQ . You are receiving this because you commented.Message ID: @.> -- Dr. Shrinivas Moorthi Research Meteorologist Modeling and Data Assimilation Branch Environmental Modeling Center / National Centers for Environmental Prediction 5830 University Research Court - (W/NP23), College Park MD 20740 USA Tel: (301)683-3718 e-mail: @.*** Phone: (301) 683-3718 Fax: (301) 683-3718
You can use the tables created in your current/previous experiment for your future experiments. I didn't find them in the current ufs_weather_model. They need to be added @yangfanglin. qr_acr_qgV2.dat, qr_acr_qsV2.dat, freezeH2O.dat
I believe they are in the fix file directory:
Hera: /scratch1/NCEPDEV/global/glopara/fix/fix_am/
Jet: /lfs4/HFIP/hfv3gfs/glopara/git/fv3gfs/fix/fix_am/
Orion: /work/noaa/global/glopara/fix/fix_am/
For UWM regression testing, the files freezeH2O.dat
, qr_acr_qgV2.dat
and qr_acr_qsV2.dat
are located in the input-data-directory in FV3_fix
:
/scratch1/NCEPDEV/nems/emc.nemspara/RT/NEMSfv3gfs/input-data-20220414/FV3_fix
Thanks all for your comments concerning the tables. This was my mistake, in the real-time LAM parallels we do use the pre-computed qr_acr_qgV2.dat and qr_acr_qsV2.dat files. But for my testing with the the new code w/Rusty's I/O changes I was running a canned case, and I forgot to add the "V2" files as input for this canned case. Sorry for the inconvenience.
@junwang-noaa Ben and I got a successful 12-h forecast of the 3km RRFS NA domain with your code with Rusty's I/O changes. Here are the fcst_initialize times with io_layout=1,1
Cold start (on Dell P3.5): fcst_initialize total time: 73.7074041366577 Warm start (on Dell P3): fcst_initialize total time: 365.096565008163
With the current code run in the LAM parallels, fcst_initialize time for warm starts with io_layout=1,1 was about one hour.
A rerun of the warm start case with io_layout=1,1 on Dell P3 this afternoon gave a much lower fcst_initialize time: fcst_initialize total time: 87.6470549106598
One warm start test, with io_layout=1,15 : fcst_initialize total time: 187.195601940155. This is roughly the same as seen in the 3 km RRFS DA parallel warm start forecasts.
They are available in the shared/standard/common "FIX" directory. They do not dependent on resolution.
HERA: /scratch1/NCEPDEV/global/glopara/fix_NEW/fix_am WCOSS: /gpfs/dell2/emc/modeling/noscrub/emc.glopara/git/fv3gfs/fix_NEW/fix_am ORION: /work/noaa/global/glopara/fix_NEW/fix_am/
For global model, there are links included in the forecast script exglobal_forecast.sh $NLN $FIX_AM/CCN_ACTIVATE.BIN $DATA/. $NLN $FIX_AM/freezeH2O.dat $DATA/. $NLN $FIX_AM/qr_acr_qg.dat $DATA/. $NLN $FIX_AM/qr_acr_qs.dat $DATA/.
@junwang-noaa and @bensonr, I got a chance to test HAFS single 3-km domain warm-start run on Orion. With io_layout=1,1, I got: fcst_initialize total time: 254.639430604875 fcst_initialize total time: 297.894324436784 for two consecutive forecast cycles, which is much faster than previously. With this speed up, it is now even faster than using io_layout=1,10 for the 20220412 develop version of ufs-weather-model (which took fcst_initialize total time: 637.429518222809).
So, thank again for fixing this slow-down issue! With that, we most likely will not need the mppnccombine workaround step anymore.
Next, I will test the namelist options to skip the checksum step for the warm-restart files (after GSI/DA updates) and report back to this thread.
@bensonr and @junwang-noaa, just a quick follow-up: while I am testing the fv_core_nml::ignore_rst_cksum=.true. and atmos_model_nml::ignore_rst_cksum=.true. options to skip the checksum for the restart files, I got the forecast failure with this error: FATAL from PE 974: check_nml_error in fms_mod: Unknown namelist, or mistyped namelist variable in namelist atmos_model_nml, (IOSTAT = 19 ) However, if I only keep fv_core_nml::ignore_rst_cksum=.true., the forecast can go through. By any chance, I somehow missconfigured something? Appreciate it if you can help to take a look when you get a chance. Thanks!
Bin
@BinLiu-NOAA - please make sure you are using the emc_io_fixes branch for fv3atm. I see the namelist variable ignore_rst_cksum in atmos_model.F90 and it doesn't look like there are any typos.
@bensonr and @junwang-noaa, I made sure to to use your emc_io_fixes branch for fv3atm. Another question is: does this ignore_rst_cksum=.true. option need a newer version than fms/2022.01 to work properly? My current test uses fms/2022.01.
Meanwhile, wondering if anyone else can test this ignore_rst_cksum option from his/her side.
Thanks!
Bin
@BinLiu-NOAA - Your test for dycore restarts was successful with the current v2022.01 library, so the library and logic in the dycore are not an issue. The error you are encountering is related to the atmos_model_nml namelist in the input.nml or the specific namelist definition in atmos_model.F90. I urge you to check both the source file and your input.nml atmos_model_nml entry (or whatever builds the atmos_model_nml in the UFS) to ensure all spellings of ignore_rst_cksum are correct or at least consistent.
@bensonr and @junwang-noaa, I think I might have found the issue now. This is because the ignore_rst_cksum option is only added here in FV3/atmos_model.F90:
logical :: ignore_rst_cksum = .false.
namelist /atmos_model_nml/ blocksize, chksum_debug, dycore_only, debug, sync, ccpp_suite, avg_max_length, &
ignore_rst_cksum
However, it was not updated here in FV3/atmos_cubed_sphere/driver/fvGFS/atmosphere.F90:
namelist /atmos_model_nml/ blocksize, chksum_debug, dycore_only, debug, sync, fdiag, fhmax, fhmaxhf, fhout, fhouthf, ccpp_suite, avg_max_length
I believe once we add ignore_rst_cksum in FV3/atmos_cubed_sphere/driver/fvGFS/atmosphere.F90, the ignore_rst_cksum option will work properly under the atmos_model_nml section as well.
Thanks!
@BinLiu-NOAA - Feel free to merge my work into your own work and make the fix. But please remind me again why the UFS has two different modules from two different repositories reading the same namelist?
@BinLiu-NOAA - Feel free to merge my work into your own work and make the fix. But please remind me again why the UFS has two different modules from two different repositories reading the same namelist?
@bensonr, I saw some notes near FV3/atmos_cubed_sphere/driver/fvGFS/atmosphere.F90, from which it looks to me, this is a temporary workaround for something. And it would be nice if this can be fixed in the future. @climbfuji and @junwang-noaa might know more background/context for this. Thanks!
If I remember correctly. it is due to the calling sequence of the ccpp physics subroutine. @climbfuji can correct me if this is not the case.
Thanks @junwang-noaa for the background!
Also, @bensonr, a quick follow-up, after adding the ignore_rst_cksum item in FV3/atmos_cubed_sphere/driver/fvGFS/atmosphere.F90, the ignore_rst_cksum worked fine for my HAFS tests. @EricRogers-NOAA and @BenjaminBlake-NOAA also tested from the RRFS side and confirmed it worked as well.
Thanks again for the speeding-up for FMS2IO as well as for the option to ignore checksum for restart files! With that we do not need to use the mppnccombine workaround in the HAFS application/workflow anymore, meanwhile can avoid the ncatted command to delete the checksum attribute from the analysis updated restart files. Much appreciated! - Bin
@bensonr I did some testing with gdas case using the GFSv16 gdas restart files with checksum in the ICs, I still got the error when reading the restart files with ignore_rst_cksum=.true..
0: in atmosphere bf fv_restart, ignore_rst_cksum= T 0: in fv_restart ncnst= 9 0: FV_RESTART: 1 T F 0: Warm starting, calling fv_io_restart 0: ptop & ks 0.9990000 39 0: 0: FATAL from PE 0: The checksum in the file:INPUT/fv_core.res.tile1.nc and variable:u does not match the checksum calculated from the data. file:D0AC0578D35B44A from data:FFF2C0BC98877DB6
Do you have any idea what I might miss? Thanks
@junwang-noaa, you might want to double check if you have ignore_rst_cksum in both fv_core_nml and atmos_model_nml sections. (for dycore fv_core_nml::ignore_rst_cksum=.true. and for the physics, atmos_model_nml::ignore_rst_cksum=.true.)
@BinLiu-NOAA @BenjaminBlake-NOAA @junwang-noaa : I'm testing the code in the RRFS 3 km North American DA run, I see significant slowdown in GSI analysis times when the model is run with io_layout=1,1, and the input first guess to the GSI are model restart files:
New code : regional_gsianl_tm03_12.log:The total amount of wall time = 1754.419832 regional_gsianl_tm04_12.log:The total amount of wall time = 2306.498491 regional_gsianl_tm05_12.log:The total amount of wall time = 2135.666786 Old code: regional_gsianl_tm03_12.log:The total amount of wall time = 1421.840473 regional_gsianl_tm04_12.log:The total amount of wall time = 1472.981763 regional_gsianl_tm05_12.log:The total amount of wall time = 1433.479971
I believe this is a chunk size issue with the model restart files. When we run the current model code in the RRFS DA parallels with io_layout=1,15, we use the mppnccombine utility to combine the 15 restart file pieces back into one file, using the "-64" option (for netcdf-3 64-bit offset, for which chunksizes are not used)
echo "$EXECfv3/mppnccombine -v -64 fv_core.res.tile1.nc" > cmdfile echo "$EXECfv3/mppnccombine -v -64 fv_tracer.res.tile1.nc" >> cmdfile
But with the new code, we can run with io_layout=1,1 so the netcdf-4 restart files written directly out of the model are used in the GSI, and because of the default chunksizes (the files dimensions are 3950x2701x65)
float ua(Time, zaxis_1, yaxis_2, xaxis_1) ;
ua:checksum = " 18F05251E151753" ;
ua:_Storage = "chunked" ;
ua:_ChunkSizes = 1, 6, 300, 439 ;
ua:_Endianness = "little" ;
GSI timings with its parallel I/O are degraded.
@BinLiu-NOAA Thanks for the suggestion, the test ran through. Here are some timing:
All testing is using io_layout (1.1)
1) From the develop branch, the c384gdas test:
0: in fv3_cap, init time= 68.9708864763379
2) From Rusty's fv3atm branch:
0: in fv3_cap, init time= 20.7666180133820
3) From Rusty's fv3atm branch with ignore_rst_cksum=.true.
0: in fv3_cap, init time= 16.9400908946991
@EricRogers-NOAA So the only change you made is io_layout change? Also I am curious, did you specify the chunksizes for restart files? I think you can still use io_layout= 1,15 if that helps with the total run time.
My understanding is that using more io tasks in io_layout will speed up reading the restart files
Today Ben Blake checked out the latest develop branch and ran a short test on WCOSS Dell Phase 3.5 with the 3 km RRFS LAM domain over North America, coldstarted from the GFS analysis, He saw an increase in the model initialization time by almost 8 minutes compared to the current parallel LAM run:
LAM parallel: in fcst,init total time: 74.2139439582825 (#ddcd809, checked out 7/30/21) My test: in fcst,init total time: 524.866119146347 (#e198256, checked out today)
Ben's test run did not run to completion so no termination times are available. Bin Liu noted similar behavior in HAFS