mom-ocean / MOM6

Modular Ocean Model
Other
185 stars 232 forks source link

NCAR to main (2023-09-14) #1610

Closed gustavo-marques closed 1 year ago

gustavo-marques commented 1 year ago

This PR includes the following:

Features


New parameters


Bug fixes & documentation improvements


Code improvements and cleanup

Contributors

codecov[bot] commented 1 year ago

Codecov Report

Merging #1610 (87cf00b) into main (afb3a44) will decrease coverage by 0.27%. The diff coverage is 16.66%.

:exclamation: Current head 87cf00b differs from pull request most recent head 3d07e5b. Consider uploading reports for the commit 3d07e5b to get more accurate results

@@            Coverage Diff             @@
##             main    #1610      +/-   ##
==========================================
- Coverage   38.17%   37.91%   -0.27%     
==========================================
  Files         269      269              
  Lines       76503    77155     +652     
  Branches    14064    14164     +100     
==========================================
+ Hits        29205    29252      +47     
- Misses      42051    42625     +574     
- Partials     5247     5278      +31     
Files Coverage Δ
...ig_src/drivers/solo_driver/MOM_surface_forcing.F90 24.04% <100.00%> (ø)
src/core/MOM.F90 51.36% <100.00%> (ø)
src/core/MOM_forcing_type.F90 42.24% <ø> (+0.27%) :arrow_up:
src/core/MOM_unit_tests.F90 15.00% <ø> (ø)
src/core/MOM_variables.F90 48.41% <ø> (+1.44%) :arrow_up:
src/diagnostics/MOM_obsolete_params.F90 73.78% <100.00%> (+0.25%) :arrow_up:
...c/parameterizations/vertical/MOM_set_viscosity.F90 60.98% <ø> (ø)
src/tracer/MOM_tracer_types.F90 0.00% <ø> (ø)
...nfig_src/infra/FMS1/MOM_ensemble_manager_infra.F90 30.00% <75.00%> (+2.22%) :arrow_up:
src/framework/MOM_get_input.F90 62.16% <0.00%> (ø)
... and 13 more

... and 1 file with indirect coverage changes

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

kshedstrom commented 1 year ago

What version of FMS is everyone using? I'm trying 2023.02 for Theresa's CIOD and it is forcing me to use the intel compiler instead of gnu. Assuming I did the git merge correctly, this PR is giving me:

use fms_io_mod, only  : fms_io_set_filename_appendix=>set_filename_appendix
------------------------^
compilation aborted for //import/c1/AKWATERS/kate/ESMG/ESMG-configs/src/MOM6/config_src/infra/FMS2/MOM_ensemble_manager_infra.F90 (code 1)
marshallward commented 1 year ago

What version of FMS is everyone using? I'm trying 2023.02 for Theresa's CIOD and it is forcing me to use the intel compiler instead of gnu. Assuming I did the git merge correctly, this PR is giving me:

use fms_io_mod, only  : fms_io_set_filename_appendix=>set_filename_appendix
------------------------^
compilation aborted for //import/c1/AKWATERS/kate/ESMG/ESMG-configs/src/MOM6/config_src/infra/FMS2/MOM_ensemble_manager_infra.F90 (code 1)

There should no longer be any references to fms_io in the source code.

@gustavo-marques I believe that this may need to be patched.

https://github.com/mom-ocean/MOM6/blob/12eae12c7fe8432ef2c7775c92cc747686123f30/config_src/infra/FMS2/MOM_ensemble_manager_infra.F90#L29-L34

Could this be an error?

gustavo-marques commented 1 year ago

Thanks for catching this. These lines were added in this PR: https://github.com/NCAR/MOM6/pull/241 I will check with @alperaltuntas and report back.

gustavo-marques commented 1 year ago

Please let us know if Alper's patch fixed this issue.

kshedstrom commented 1 year ago

I still got the same error. You took out the line using the function, but you still need to get rid of this line:

use fms_io_mod, only  : fms_io_set_filename_appendix=>set_filename_appendix

Without this line, it compiles. Some tests are successful, some not, but they do run when compiled for debugging which is annoying. A typical error from say MOM6-examples/ocean_only/single_column/EPBL:

FATAL: write_energy : Excessive energy per unit mass forced model termination.

Image              PC                Routine            Line        Source             
MOM6               00000000018514E7  Unknown               Unknown  Unknown
MOM6               00000000012A20F6  mpp_mod_mp_mpp_er          72  mpp_util_mpi.inc
MOM6               0000000000C88B5F  mom_error_handler         154  MOM_error_handler.F90
MOM6               000000000041CBAC  mom_sum_output_mp         901  MOM_sum_output.F90
MOM6               00000000010CE52E  mom_mp_step_mom_         1016  MOM.F90
MOM6               0000000000699337  MAIN__                    484  MOM_driver.F90
MOM6               00000000004132E2  Unknown               Unknown  Unknown
libc-2.28.so       00007FDAC7519D85  __libc_start_main     Unknown  Unknown
MOM6               00000000004131EE  Unknown               Unknown  Unknown
Abort(1) on node 0 (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0

Compiler: intel/2022a

alperaltuntas commented 1 year ago

I still got the same error. You took out the line using the function, but you still need to get rid of this line:

Oops, I'll correct that.

As for the other error (Excessive energy per unit mass forced model termination), I was unable to reproduce it with intel/2022.1 and FMS 2023.02 on cheyenne, and with the latest MOM6-examples configuration. Which version of single_column/EPBL run are you running? Have you tried another compiler?

kshedstrom commented 1 year ago

Since it runs in debug mode, I'm willing to chalk it up as a compiler issue. All the ESMG tests run.

I don't have other compilers to try on "new chinook" with my current software versions. I'll approve assuming that line will get taken out.

jiandewang commented 1 year ago

It worked fine on R&D machine (b4b results) but I am seeing different results on EMC operational machine. I am doing debug work and will provide some clue soon.

jiandewang commented 1 year ago

@alperaltuntas and @gustavo-marques I have a pair of run log, restart file, et. al at https://www.emc.ncep.noaa.gov/gc_wmb/wd20xw/JD/NCAR-PR.tar-txt you need to remove the "-txt" from the file ename after downloading and then untar it. Our system doesn't allow file name ended as "tar". "*err" are run log. "BL" means current baseline, and you know what "PR" means in the file name what I see is difference starts from 380c381 < h-point: c= 3602810 Before steps forces%ustar

h-point: c= 3602765 Before steps forces%ustar

kshedstrom commented 1 year ago

Switching to ifort 2023.1 fixed this for me (and broke something else).

jiandewang commented 1 year ago

@gustavo-marques and @alperaltuntas some updating from UFS: I tried bisect and made some testing, what I can confirm is the last commit that works for UFS is commit d4aa10857b7679701fdbd44ea67adc6213c0800c Author: Ian Grooms ian.grooms@colorado.edu Date: Tue Aug 29 09:39:55 2023 -0600

Add Leith+E (#251)

* Add Leith+E

and problem starts from: d7415177e47bcfd00a6c46db8184d7f982211f76 Merge: 92756fe84 d4aa10857 Author: Gustavo Marques gmarques@ucar.edu Date: Wed Aug 30 14:45:49 2023 -0600

Merge branch 'dev/ncar' into fpmix_draft_15August2023

our regression system only allows each user to run one regression test at the same time, thus it is a slow process for me to provide this clue to you.

marshallward commented 1 year ago

This PR passed our regression, but I suspect we will need a bit of time to manually review the changes.

I am also a bit concerned about the issues raised by @kshedstrom, although I'm not yet sure which are related to this PR.

gustavo-marques commented 1 year ago

@jiandewang, thanks for doing these tests and narrowing down where the change is coming from. I will look into this tomorrow and get back to you.

jiandewang commented 1 year ago

@jiandewang, thanks for doing these tests and narrowing down where the change is coming from. I will look into this tomorrow and get back to you.

@gustavo-marques one more clue for you: UFS quarter degree runs are fine on wcoss2. The half and 1x1 degree runs all failed to retain baseline results. We follow closely to the MOM_input in MOM_examples for all these 3 resolutions, so the issue must be some parameter related but I can't figure out which one it is.

gustavo-marques commented 1 year ago

@marshallward; regression tests are failing because KhTr_u, KhTr_v, and KhTr_h were changed from 2D (lat/lon) to 3D (lat/lon/depth).

gustavo-marques commented 1 year ago

@jiandewang; we are still unable to find where the issue is coming from. Can you please send us the MOM_parameters_doc.short for the passing (1/4 deg) and failing (1/2 deg)?

jiandewang commented 1 year ago

@jiandewang; we are still unable to find where the issue is coming from. Can you please send us the MOM_parameters_doc.short for the passing (1/4 deg) and failing (1/2 deg)?

see https://www.emc.ncep.noaa.gov/gc_wmb/wd20xw/JD/MOM_parameter_doc.short.tar-txt, remove "txt" before untar it

marshallward commented 1 year ago

I have tried to look at this, and the only thing that seems to make sense is a regression in forces%ustar. Problems don't appear to become serious until the ePBL solver, but the differing term, Kd_ePBL, depends on ustar.

The only idea I have at the moment is that the inclusion of forces%omega_w2x is causing some kind of pointer corruption, or is shuffling things around in memory in a way that has discovered a hidden bug. But that feels like a "blame the compiler" excuse, not a true explanation.

My best suggestion for now is to completely remove all of the content related to omega_w2x and see if it helps to fix the problem. If so, then I recommend we remove all of this work and address it in a separate PR.

gustavo-marques commented 1 year ago

@marshallward: thanks for the feedback. @jiandewang: I commented out all omega_w2x entries in my last commit. Can you please run your tests again and let us know if the issue is fixed?

jiandewang commented 1 year ago

@gustavo-marques: with your latest commit, all UFS jobs retain current baseline now.

marshallward commented 1 year ago

@gustavo-marques Did you want to clean up these changes? Or did you want to leave them in so that we can revisit this problem?

This is now the second mysterious computational ~issue~ regression detected by WCOSS, we may need to start doing some work to figure out what could be happening here.

gustavo-marques commented 1 year ago

@marshallward, our computers are down this week. I suggest we take this PR as is and we (NCAR) will revisit this issue in a future PR.

marshallward commented 1 year ago

@gustavo-marques Hope you don't mind, I went ahead and merged this for you.

Thanks to everyone, this one was particularly tricky.

gustavo-marques commented 1 year ago

@marshallward Thanks for merging it