Closed marshallward closed 9 months ago
Attention: Patch coverage is 24.13651%
with 1845 lines
in your changes are missing coverage. Please review.
Project coverage is 37.45%. Comparing base (
dbda49c
) to head (6d7c00a
). Report is 15 commits behind head on main.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
@marshallward a new variable "p_surf_EOS" appeared in the restart file in one of our runs (super coarse resolution and simple settng for our script testing). All others (1x1, 05x05 and quarter degree) don't contain this variable. Which parameter is associated with this ?
I notice that in src/tracer/MOM_CFC_cap.F90
, the argument conversion=GV%Rho0*US%R_to_kg_m3
was added to the register_diag_field
call for the diagnostics cfc11
and cfc12
. The field being passed to post_data
for these diagnostics is the product of the tracer concentrations and the value being passed to this conversion
argument. Should this multiplication have been removed in the post_data
call?
It looks like this was done properly in https://github.com/NOAA-GFDL/MOM6/commit/d16f343e42615d09761dc50aaca268b60a7ca504, but got messed up with the merge of main
to dev/gfdl
at https://github.com/NOAA-GFDL/MOM6/commit/e5b64f91cc6a86d88c5ba34974e39b30697ec660.
@klindsay28 Yes, I bungled this part of the merge. Good thing that you caught it.
I will put in a fix to remove those units from the post_data
call.
@marshallward the wcoss2 non-reproducable issue happened again. I will do some debug test and provide some clue for you.
@marshallward a new variable "p_surf_EOS" appeared in the restart file in one of our runs (super coarse resolution and simple settng for our script testing). All others (1x1, 05x05 and quarter degree) don't contain this variable. Which parameter is associated with this ?
This is a new restart variable, it only appears if there is a surface pressure (tv%psurf
). It's required to work out the non-Boussinesq densities, but @Hallberg-NOAA can say more.
(Commit: 11c3f56435)
Re: WCOSS2 regression, I'll wait to hear back from you. We might need a deep dive to figure out why this keeps happening.
@marshallward a new variable "p_surf_EOS" appeared in the restart file in one of our runs (super coarse resolution and simple settng for our script testing). All others (1x1, 05x05 and quarter degree) don't contain this variable. Which parameter is associated with this ?
This is a new restart variable, it only appears if there is a surface pressure (
tv%psurf
). It's required to work out the non-Boussinesq densities, but @Hallberg-NOAA can say more.(Commit: 11c3f56)
@marshallward this solved my puzzle as we use its default (True) in super coarse setting while it's False in others. A related byproduct is found here: https://github.com/NOAA-GFDL/MOM6-examples/blob/dev/gfdl/ice_ocean_SIS2/OM4_025/MOM_input#L57 https://github.com/NOAA-GFDL/MOM6-examples/blob/dev/gfdl/ice_ocean_SIS2/OM4_05/MOM_input#L57 the default should be True Re: WCOSS2 regression, I'll wait to hear back from you. We might need a deep dive to figure out why this keeps happening.
@marshallward I put run log file on gaea: /lustre/f2/dev/ncep/Jiande.Wang/For-Marshall. Note on wcoss2 we have 2 jobs that are compiled under DEBUG mode and they retain the results.
Looks like the first diff is in vertvisc_coef
:
4540c4574
< v-point: c= 182747883 v vertvisc_coef a_[uv]
---
> v-point: c= 182747884 v vertvisc_coef a_[uv]
4634a4669,4672
So only a single bit has changed (and presumably propagated to the future changes).
There are many changes in find_coupling_coef
but they mostly involve removal of dimensional coefficients. Their removal should not normally affect the computation. Perhaps it could affect the manner of optimization, but I would not think so.
In one case, these arguments were reversed:
@@ -2042,9 +2096,9 @@ subroutine find_coupling_coef(a_cpl, hvel, do_i, h_harm, bbl_thick, kv_bbl, z_i,
! These expressions assume that Kv_tot(i,nz+1) = CS%Kv, consistent with
! the suppression of turbulent mixing by the presence of a solid boundary.
if (dhc < bbl_thick(i)) then
- a_cpl(i,nz+1) = kv_bbl(i) / (I_amax*kv_bbl(i) + (dhc+h_neglect)*GV%H_to_Z)
+ a_cpl(i,nz+1) = kv_bbl(i) / ((dhc+h_neglect) + I_amax*kv_bbl(i))
else
- a_cpl(i,nz+1) = kv_bbl(i) / (I_amax*kv_bbl(i) + (bbl_thick(i)+h_neglect)*GV%H_to_Z)
+ a_cpl(i,nz+1) = kv_bbl(i) / ((bbl_thick(i)+h_neglect) + I_amax*kv_bbl(i))
endif
endif ; enddo
do K=nz,2,-1 ; do i=is,ie ; if (do_i(i)) then
I can't imagine this changing answers (assuming FMA is turned off), but maybe a compiler responds differently?
There is also a change from forces%ustar
to an explicit Ustar_2d
argument. This somewhat resembles the issue from the NCAR PR. Maybe there is some memory issue with forces
.
I'm sure I've asked all of this before, but could you remind me about WCOSS2? Is it using the REPRO flags from a mkmf template? The -O level would be most important here. Also ifort --version
and lscpu | grep "Model name"
might help.
@marshallward we are using -O2 jiande.wang@clogin05:~> ifort --version ifort (IFORT) 19.1.3.304 20200925 Copyright (C) 1985-2020 Intel Corporation. All rights reserved.
jiande.wang@clogin05:~> lscpu |grep "Model name" Model name: AMD EPYC 7702 64-Core Processor
-g -traceback -i4 -r8 -fno-alias -auto -safe-cray-ptr -ftz -assume byterecl -sox -O2 -debug minimal -fp-model
-g -traceback -i4 -r8 -fno-alias -auto -safe-cray-ptr -ftz -assume byterecl -sox -O2 -debug minimal -fp-model
-fp-model source
?
-g -traceback -i4 -r8 -fno-alias -auto -safe-cray-ptr -ftz -assume byterecl -sox -O2 -debug minimal -fp-model source -module mod -c /lfs/h2/emc/couple/noscrub/Jiande.Wang/MOM6-update/20231113/ufs-weather-model-BL/MOM6-interface/MOM6/config_src/infra/FMS2/MOM_couplertype_infra.F90 -o CMakeFiles/mom6_obj.dir/MOM6/config_src/infra/FMS2/MOM_couplertype_infra.F90.o
-g -traceback -i4 -r8 -fno-alias -auto -safe-cray-ptr -ftz -assume byterecl -sox -O2 -debug minimal -fp-model
-fp-model source
?
I put compile log at /lustre/f2/dev/ncep/Jiande.Wang/For-Marshall. see compile.txt. It's hard to read but you can jump to line 900.
The non-Boussinesq changes caused me an issue address with https://github.com/NOAA-GFDL/MOM6/pull/520
The non-Boussinesq changes caused me an issue address with https://github.com/NOAA-GFDL/MOM6/pull/520.
@kshedstrom ⬆️ URL does not exist!
Ack, fixed it.
@jiandewang Is there any chance we (you) could do a git bisect
to find the specific commit that causes the divergence? Our PR has a linear history so it should be definitive. Without knowing the commit (and without access to the machine), we are essentially speculating about the problem.
@jiandewang Is there any chance we (you) could do a
git bisect
to find the specific commit that causes the divergence? Our PR has a linear history so it should be definitive. Without knowing the commit (and without access to the machine), we are essentially speculating about the problem.
yes that's what in my mind. Let me have a short chat with Marshall.
The new p_surf_EOS
restart variable is required so that the potential energy reported at the start of a run-segment in non-Boussinesq mode matches the energy written out at the end of the previous segment. It is being saved whenever USE_PSURF_IN_EOS = True
so that any expressions that use the equation of state during the ocean model initialization before the coupler has had a chance to provide the surface pressure for the next timestep will be correct. In the case of a non-Boussinesq model, the specific volume is needed in the calculation of APE, but it is easy to envision other (Boussinesq) quantities for which density needs to be derived from the temperature, salinity and layer thickness fields during initialization, and this extra surface pressure field is the one piece that is missing. At no point would adding this new restart field break reproducibility across restarts for any case that worked previously.
However, if people do have a problem with this speculatively added restart variable, we could easily add to the logical test before the call register_restart_field(CS%tv%p_surf, "p_surf_EOS", ...)
statement so that it is only written in the non-Boussinesq case where it is essential.
The new
p_surf_EOS
restart variable is required so that the potential energy reported at the start of a run-segment in non-Boussinesq mode matches the energy written out at the end of the previous segment. It is being saved wheneverUSE_PSURF_IN_EOS = True
so that any expressions that use the equation of state during the ocean model initialization before the coupler has had a chance to provide the surface pressure for the next timestep will be correct. In the case of a non-Boussinesq model, the specific volume is needed in the calculation of APE, but it is easy to envision other (Boussinesq) quantities for which density needs to be derived from the temperature, salinity and layer thickness fields during initialization, and this extra surface pressure field is the one piece that is missing. At no point would adding this new restart field break reproducibility across restarts for any case that worked previously.However, if people do have a problem with this speculatively added restart variable, we could easily add to the logical test before the
call register_restart_field(CS%tv%p_surf, "p_surf_EOS", ...)
statement so that it is only written in the non-Boussinesq case where it is essential.
@Hallberg-NOAA this is fine for us as what I need to know is the reason for this new variable that appeared in one of our test (and you and Marshall have explained to me clearly).
@jiandewang Is there any chance we (you) could do a
git bisect
to find the specific commit that causes the divergence? Our PR has a linear history so it should be definitive. Without knowing the commit (and without access to the machine), we are essentially speculating about the problem.yes that's what in my mind. Let me have a short chat with Marshall.
that's what I did in previous NCAR PR. This time things are a bit trick as there are several new files being added at certain stage (we use cmake which requires to list every files) and there is a parameter change (DEFAULT_2018_ANSWERS) at certain stage.
@jiandewang Is there any chance we (you) could do a
git bisect
to find the specific commit that causes the divergence? Our PR has a linear history so it should be definitive. Without knowing the commit (and without access to the machine), we are essentially speculating about the problem.yes that's what in my mind. Let me have a short chat with Marshall.
that's what I did in previous NCAR PR. This time things are a bit trick as there are several new files being added at certain stage (we use cmake which requires to list every files) and there is a parameter change (DEFAULT_2018_ANSWERS) at certain stage.
If you explicitly setting DEFAULT_2018_ANSWERS
in your tests, you should be able to get the same effect for every commit in this PR chain by setting DEFAULT_ANSWER_DATE
to the right value (20181231 or 20190101) instead.
Just to confirm, DEFAULT_2018_ANSWERS = False
and DEFAULT_ANSWER_DATE = 20190101
are equivalent, right?
Just to confirm,
DEFAULT_2018_ANSWERS = False
andDEFAULT_ANSWER_DATE = 20190101
are equivalent, right?
A better explicit value might be to set DEFAULT_ANSWER_DATE = 99991231
, which is the same as always using the latest code options unless they are explicitly set for individual parameters. However, if you know a specific date for the code version you want to reproduce, you could use that date for the default_answer date (e.g, if you have code from February 1st, 2023, you might want to set DEFAULT_ANSWER_DATE = 20230201
). I think that we this scheme in detail when we started the obsoleting process for the ..._2018_ANSWERS
and replacing them with ANSWER_DATE
parameters at a MOM6 dev call in the summer of 2022.
Regardless, we have been logging the ANSWER_DATE parameters in the MOM_parameter_doc files for over a year now, and these are what you need to reproduce to recreate a specific set of answers.
@gustavo-marques, we are getting b4b results except for the salt_flux_added
field in the h.native
stream: Previously, we were getting all 0s on ocean points, but now we are getting all NaNs everywhere. All other streams and fields are identical. Is this expected or should we further investigate?
@gustavo-marques, we are getting b4b results except for the
salt_flux_added
field in theh.native
stream: Previously, we were getting all 0s on ocean points, but now we are getting all NaNs everywhere. All other streams and fields are identical. Is this expected or should we further investigate?
@alperaltuntas, thanks for running the tests and reporting this. I never look at salt_flux_added
only salt_flux
and salt_flux_in
. I do not think this should hold this PR, but we definitely should investigate this further. Perhaps salt_flux_added
is not working properly within the NUOPC cap.
It looks like https://github.com/NOAA-GFDL/MOM6/pull/401 added a partial implementation of a new diagnostic salt_left_behind
, and overwrote the value of handles%id_saltFluxAdded
, causing the diagnostic salt_flux_added
to never get values posted to it.
2 possible approaches to correcting this are
register_diag_field
for salt_left_behind
which overwrites handles%id_saltFluxAdded
, i.e., remove the partial implementation of the diagnostic salt_left_behind
, orsalt_left_behind
.I suggest that @kshedstrom, as the author of https://github.com/NOAA-GFDL/MOM6/pull/401, should chime in.
I need to look into this. I meant to add a new variable to pass from SIS2 to MOM6 via the coupler, not to break anything else.
Please check out NOAA-GFDL/MOM6#526
Please check out NOAA-GFDL/MOM6#526
Thanks, @kshedstrom. This fixes our isssue (though it appears to have failed the CI tests)
Yeah, I know it failed, but I have no idea why. Sigh.
Not sure where that phantom FPE came from in MacOS, but it's passing now.
@klindsay28 @alperaltuntas Do you want this merged into the PR to main? (I will need to do a final regression test)
I am getting the following error when running in DEBUG mode with Intel:
It seems related to the introduction of pointer int_tide_CSp
in https://github.com/NOAA-GFDL/MOM6/commit/1d35fa1c2f984b94041029f6857e6dfaa410198f
I am getting the following error when running in DEBUG mode with Intel:
It seems related to the introduction of pointer
int_tide_CSp
in NOAA-GFDL@1d35fa1
We have run into this sort of problem before (though I'm yet to test this PR!).
@gustavo-marques what are your compiler flags in debug
mode?
@gustavo-marques it passed in UFS with debug mode, our flag is -g -traceback -i4 -r8 -fno-alias -auto -safe-cray-ptr -ftz -assume byterecl -sox -O0 -check -check noarg_temp_created -check nopointer -fpe0 -ftrapuv -init=snan,arrays -module mod -c
Here is our flags:
-qno-opt-dynamic-align -convert big_endian -assume byterecl -ftz -traceback -assume realloc_lhs -fp-model source -O2 -debug minimal -no-fma -qopt-report -march=core-avx2
CS%int_tides_CSp
is allocated when CS%adiabatic
is false (although there is at least one extra function hop).
although it does take some time to get there:
The else-block here can only be entered when diabatic:
Assuming nothing behaves funny, CS%int_tide_CSp
should be associated:
There are far too many pointers for my taste, but at least in principle it ought to be associated, and the error is confusing to me.
Since the error is in the function call, I have to at least ask if the .mod
files are all up to date.
Assuming that, what does the debugger say? If it's still null()
, then I think we should get feedback from @raphaeldussin.
Thank you all for the suggestions!
@marshallward, the .mod
files were not up to date and that's why it was failing. I did a clean build and the error disappeared. I'm sorry for raising a false issue.
Regarding Kate's PR, let us merge it with this candidate and we can re-run our tests again.
Thanks!
NOAA-GFDL/MOM6#526 has now been merged into this candidate branch.
I am also looking into this FPE error on the MacOS tests, it seems to have only recently just started happening. Kate first noticed it in her PR.
For now, it seems that re-running the test seems to work, which we'll have to accept for now. For those getting email updates, I apologize for the unnecessary spam.
I encountered the same error message @gustavo-marques posted.
CS%int_tides_CSp is allocated when CS%adiabatic is false (although there is at least one extra function hop).
It appears that for CS%int_tides_CSp
to be allocated, INTERNAL_TIDES
must be set to True as well : https://github.com/NOAA-GFDL/MOM6/blob/f4c95ec4b81e2958c151be101d2b15b0de28f910/src/parameterizations/vertical/MOM_diabatic_driver.F90#L3589C1-L3589C1
While I'm not proposing this as a solution, I tried out removing the requirement for INTERNAL_TIDES
to be true for the allocation of CS%int_tides_CSp
. This, however, led to another error:
FATAL from PE 2: Unable to find a match for dimension angle for variable IW_energy_mode1 in file ./MOM_IC.nc
@jiandewang I have finally managed to get access to WCOSS and have completed some run comparisons. I believe the cause is due to fused multiply-adds (FMAs) being enabled on WCOSS.
I discovered that reversing the order of an equation changed answers. This is the diff:
--- a/src/parameterizations/vertical/MOM_vert_friction.F90
+++ b/src/parameterizations/vertical/MOM_vert_friction.F90
@@ -2042,9 +2042,11 @@ subroutine find_coupling_coef(a_cpl, hvel, do_i, h_harm, bbl_thick, kv_bbl, z_i,
! These expressions assume that Kv_tot(i,nz+1) = CS%Kv, consistent with
! the suppression of turbulent mixing by the presence of a solid boundary.
if (dhc < bbl_thick(i)) then
- a_cpl(i,nz+1) = kv_bbl(i) / (I_amax*kv_bbl(i) + (dhc+h_neglect)*GV%H_to_Z)
+ a_cpl(i,nz+1) = kv_bbl(i) / ((dhc+h_neglect)*GV%H_to_Z + I_amax*kv_bbl(i))
else
- a_cpl(i,nz+1) = kv_bbl(i) / (I_amax*kv_bbl(i) + (bbl_thick(i)+h_neglect)*GV%H_to_Z)
+ a_cpl(i,nz+1) = kv_bbl(i) / ((bbl_thick(i)+h_neglect)*GV%H_to_Z + I_amax*kv_bbl(i))
endif
endif ; enddo
do K=nz,2,-1 ; do i=is,ie ; if (do_i(i)) then
When I looked at the assembly, I noticed that it was using FMAs to evaluate the expression, which would explain the answer difference.
113070 .loc 1 2045 is_stmt 1 | 113070 .loc 1 2045 is_stmt 1
113071 vaddsd %xmm2, %xmm5, %xmm1 #2045| 113071 vmulsd %xmm3, %xmm4, %xmm1 #204
113073 vmulsd %xmm9, %xmm1, %xmm2 #2045| 113073 vaddsd %xmm2, %xmm5, %xmm2 #204
113075 vfmadd231sd %xmm4, %xmm3, %xmm2 #2045| 113075 vfmadd213sd %xmm1, %xmm9, %xmm2 #204
113077 vdivsd %xmm2, %xmm3, %xmm3 #2045| 113077 vdivsd %xmm2, %xmm3, %xmm3 #204
113079 vmovsd %xmm3, -8(%rdi,%r11,8) #2045| 113079 vmovsd %xmm3, -8(%rdi,%r11,8) #204
113081 jmp ..B8.146 # Prob 100% #2045| 113081 jmp ..B8.146 # Prob 100% #204
Although your template does not turn them on, the Cray wrapper does this implicitly:
$ ftn -craype-verbose
ifort -march=core-avx2 -mtune=core-avx2
The -march
flags will allow FMAs if the hardware supports them. Since fma(a,b,c)
is more accurate than a*b + c
, it will change answers.
We introduce a lot of changes using dimensional constants, which could easily push the compiler to use or remove existing FMAs. Although most of the time these should not have any effect, there are plenty of situations where it can make a difference.
I repeated the runs with the -no-fma
flag, and did not detect any answer changes, so this is almost certainly the problem. But it should also be tested in the UFS suite.
As for how to solve this, I can only see two options:
We could hunt for every single scenario where an FMA could have been affected, and try to replicate the old operation with parentheses.
EMC could modify its regression tests to disable FMAs. Unfortunately, this will also probably slow down the model.
GFDL does not use FMAs because of this problem, and also suffer from this performance loss.
Neither of these options is good. But I don't think we have a choice. We might need to discuss this at a future MOM meeting.
As for long term solutions, we may need to do more aggressive FMA/no-FMA testing before submitting PRs for review.
None of the intel compiler templates at mom-ocean/mkmf/templates include -no-fma. They do typically include -fp-model source and -ftz.
For HYCOM, we always use -fp-model precise -no-fma -ftz, because otherwise we don't get bit for bit reproducability on different numbers of MPI tasks (due to the loop prologue or epilogue after vectorization being different for different tile sizes).
None of the intel compiler templates at mom-ocean/mkmf/templates include -no-fma. They do typically include -fp-model source and -ftz.
Whether or not the ftn
wrapper include -march=core-avx2
will depend on the system. For the systems which have it, we set -march=core-avx-i
which overrides -march=core-avx2
and disables FMAs. I looked through our output and did not see any FMA instructions.
WCOSS was also using -fp-model source
so it would seem that it does not inhibit FMAs.
yes in UFS we have fp-model but for wcoss2 we also have march=core-avx2 https://github.com/ufs-community/ufs-weather-model/blob/develop/cmake/Intel.cmake#L22-L28
I just took a look at GEOS and we are...confused, I guess. 😄
For GNU, we generally don't specify so I guess GNU can do what it wants. Well, other than some not-well-tested Aggressive build options where FMA was turned off per a suggestion from the mailing list.
For Intel, we do have it on, though it would be interesting to see if it matters. Our Intel flags are a melange of historical things. Essentially we run with:
-O3 -g -march=core-avx2 -fma -align array32byte -traceback -assume realloc_lhs
-qopt-report0 -align all -fno-alias -fpe3
-fp-model fast -fp-model source -fp-model consistent -ftz
-assume noold_maxminloc -diag-disable 10121 -align dcommons -fPIC -qopenmp
so who knows what is actually on and off by the end. But we've found through trial-and-error this gives us okay performance and the reproducibility between chips we want.
so who knows what is actually on and off by the end. But we've found through trial-and-error this gives us okay performance and the reproducibility between chips we want.
I'm not aware of the exact criteria for reproducibility between chips
of MOM6. Does it include:
@mathomp4 with those flags, you are almost certainly using FMAs. I see nothing that would explicitly disable them. I can't speak to the degree of reproducibility, but that is a decision for GEOS.
@sanAkel I don't think we have good information on this topic. We saw answer changes from Intel to AMD, but this turned out to be related to different math libraries, not the chips.
I think we could say a lot more on the topic of reproducibility across hardware and compiler settings, but it might be too much a digression for a PR which has already languished for a very long time. We could start up a new Discussions topic?
We could start up a new Discussions topic?
👍
in UFS we just added regional coupled run (fv3+MOM6+WAVE) half week ago. Testing from HAFS group (@binli2337) show results changes. John Stephen (@JohnSteffen-NOAA) also reported answer chages in his MOM6 standalone tests. Coupled runs are done on HERA, Bin Li is testing on wcoss2 now . Standalone test is done on ORION (which is similar to HERA). I have put the run log files on GAEA /gpfs/f5/nggps_emc/scratch/Jiande.Wang/For-Marshall/HAFS. Diff on the run log (with DEBUG on) shows difference starts from MOM_initialize_state. Both coupled and standalone runs show similar diff results.
This PR contains many new features, bugfixes, and overall changes to the MOM6 codebase. In particular, the model now includes a full non-Boussinesq solver alongside the existing Boussinesq and "semi-non-Boussinesq" modes. Other major features include updates to the ice shelf model, improved particle tracer support, improvements to the computational performance of the Zanna-Bolton parameterization, fixes to the internal tide energetics, brine plume mixing, and self-attraction and loading.
It also includes the usual bundle of bug fixes and code refactors.
Non-Boussinesq
Ice Shelf
Particle Tracers
ZB2020
Internal Tide
Brine Plume Mixing
Self-attraction and Loading
Additional Features
Bugfix
Refactor
Testing
Build
Misc
Contributors