oceanmodeling / schism-esmf

Earth System Modeling Framework cap for SCHISM
0 stars 0 forks source link

RT baseline check is failing for coastal_ike_shinnecock_atm2sch2ww3 #2

Closed uturuncoglu closed 4 months ago

uturuncoglu commented 5 months ago

@janahaddad @pvelissariou1 @saeed-moghimi-noaa I created baselines for supported test on Hercules and run the tests against baseline. The coastal_ike_shinnecock_atm2sch2ww3 test is failing as following,

==> logs/log_hercules/rt_004_coastal_ike_shinnecock_atm2sch2ww3_intel.log <==

baseline dir = /work2/noaa/nems/tufuk/RT/NEMSfv3gfs/develop-20240126/coastal_ike_shinnecock_atm2sch2ww3_intel
working dir  = /work2/noaa/stmp/tufuk/stmp/tufuk/FV3_RT/rt_732961/coastal_ike_shinnecock_atm2sch2ww3_intel
Checking test 004 coastal_ike_shinnecock_atm2sch2ww3_intel results ....
 Comparing outputs/schout_000000_1.nc .........OK
 Comparing outputs/schout_000000_2.nc .........OK
 Comparing outputs/schout_000001_1.nc .........OK
 Comparing outputs/schout_000001_2.nc .........OK
 Comparing outputs/schout_000002_1.nc .........OK
 Comparing outputs/schout_000002_2.nc .........OK
 Comparing 20080824.000000.out_grd.ww3.nc .........OK
 Comparing 20080825.000000.out_grd.ww3.nc .........OK
 Comparing 20080826.000000.out_grd.ww3.nc .........OK
 Comparing 20080827.000000.out_grd.ww3.nc .........OK
 Comparing 20080828.000000.out_grd.ww3.nc .........OK
 Comparing ufs.cpld.cpl.hi.2008-08-24-00000.nc ............ALT CHECK......NOT OK
 Comparing ufs.cpld.cpl.hi.2008-08-25-00000.nc ............ALT CHECK......NOT OK
 Comparing ufs.cpld.cpl.hi.2008-08-26-00000.nc ............ALT CHECK......NOT OK
 Comparing ufs.cpld.cpl.hi.2008-08-27-00000.nc ............ALT CHECK......NOT OK
 Comparing ufs.cpld.cpl.hi.2008-08-28-00000.nc ............ALT CHECK......NOT OK

 0: The total amount of wall time                        = 548.940647
 0: The maximum resident set size (KB)                   = 259812

Test 004 coastal_ike_shinnecock_atm2sch2ww3_intel FAIL

It seems that we have reproducibility or configuration issue in here that needs to be solved. Maybe it is just metadata issue in the mediator history since model output reproduces. Anyway, this needs to be investigated. I don't think this configuration is ever tested agains the baseline before. I am also creating baseline on Orion and test over there too.

uturuncoglu commented 5 months ago

I checked with NCAR's cprnc tool and it seems that only one field (Med_frac_ocn_ofrac) is not matching,

 Med_frac_ocn_ofrac   (Med_frac_ocn_nx,Med_frac_ocn_ny,time)  t_index =      1     1
         16     5780  (    53,     1,     1) (  2698,     1,     1) (    29,     1,     1) (    29,     1,     1)
                5780   2.141958208000000E+09  -1.874137024000000E+09 2.6E+09 -1.234390976000000E+09 1.4E-03 -1.234390976000000E+09
                5780   1.627074624000000E+09  -1.784926144000000E+09          1.332195392000000E+09          1.332195392000000E+09
                5780  (     1,     1,     1) (  2367,     1,     1)
          avg abs field values:    1.485276628719723E+06    rms diff: 6.1E+07   avg rel diff(npos):  1.4E-03
                                   1.730461296712803E+06                        avg decimal digits(ndif):  1.0 worst: -0.3
 RMS Med_frac_ocn_ofrac               6.0673E+07            NORMALIZED  3.7735E+01

This could be issue in the SCHSIM cap but it is also weird that coastal_ike_shinnecock_atm2sch is passing without any issue. Maybe that does not pass ocean fraction?

uturuncoglu commented 5 months ago

I checked the coastal_ike_shinnecock_atm2sch case. That has also Med_frac_ocn_ofrac field and if I check it with ncdump -v it seems fine.

If I check the same field in coastal_ike_shinnecock_atm2sch2ww3_intel test, I am seeing very large and strange values (probably not initialized). I need to check the mediator to understand how this field is calculated (or maybe provided by the SCHSIM as mask - need to double check). There might be an issue of data assignment or mesh itself.

 Med_frac_ocn_ofrac =
  -369381312, 5383, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 
    3, 3, 3, 3, 3, 3, 3, -1234390976, 5324, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 
    3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2141958208, 5446, 3, 3, 3, 3, 3, 3, 3, 
    3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 
    3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1220292672, 5393, 3, 3, 3, 3, 3, 
uturuncoglu commented 5 months ago

@pvelissariou1 Are coastal_ike_shinnecock_atm2sch2ww3 and coastal_ike_shinnecock_atm2sch using same mesh? It is strange that both case has different value in the Med_frac_ocn_ofrac field. The coastal_ike_shinnecock_atm2sch case has something like following which seems correct to me.

 Med_frac_ocn_ofrac =
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

We are setting ocean_mask using idry_e. So, maybe there is an issue in the configuration.

https://github.com/oceanmodeling/schism-esmf/blob/9fc62b21afc1edcb691bef5ffa5b309375d5b07c/src/schism/schism_nuopc_cap.F90#L841

@josephzhang8 How idry_e is initialized? Does the model reads it from file? Maybe mesh and the idry_e is not consistent for coastal_ike_shinnecock_atm2sch2ww3 configuration. Not sure.

josephzhang8 commented 5 months ago

@uturuncoglu idry_e is init'ed before time stepping to reflect the wet/dry condition initially if ihot=0. If ihot/=0, the flags are read in from hotstart.nc.

Maybe some UFS calls precede the SCHISM init?

uturuncoglu commented 5 months ago

@josephzhang8 Let me check. I have no issue with coastal_ike_shinnecock_atm2sch case. It could be related with this specific configuration.

uturuncoglu commented 5 months ago

This is also failing on Orion agains the baseline.

==> logs/log_orion/rt_004_coastal_ike_shinnecock_atm2sch2ww3_intel.log <==

baseline dir = /work/noaa/nems/tufuk/RT/NEMSfv3gfs/develop-20240126/coastal_ike_shinnecock_atm2sch2ww3_intel
working dir  = /work/noaa/stmp/tufuk/stmp/tufuk/FV3_RT/rt_100793/coastal_ike_shinnecock_atm2sch2ww3_intel
Checking test 004 coastal_ike_shinnecock_atm2sch2ww3_intel results ....
 Comparing outputs/schout_000000_1.nc .........OK
 Comparing outputs/schout_000000_2.nc .........OK
 Comparing outputs/schout_000001_1.nc .........OK
 Comparing outputs/schout_000001_2.nc .........OK
 Comparing outputs/schout_000002_1.nc .........OK
 Comparing outputs/schout_000002_2.nc .........OK
 Comparing 20080824.000000.out_grd.ww3.nc .........OK
 Comparing 20080825.000000.out_grd.ww3.nc .........OK
 Comparing 20080826.000000.out_grd.ww3.nc .........OK
 Comparing 20080827.000000.out_grd.ww3.nc .........OK
 Comparing 20080828.000000.out_grd.ww3.nc .........OK
 Comparing ufs.cpld.cpl.hi.2008-08-24-00000.nc ............ALT CHECK......NOT OK
 Comparing ufs.cpld.cpl.hi.2008-08-25-00000.nc ............ALT CHECK......NOT OK
 Comparing ufs.cpld.cpl.hi.2008-08-26-00000.nc ............ALT CHECK......NOT OK
 Comparing ufs.cpld.cpl.hi.2008-08-27-00000.nc ............ALT CHECK......NOT OK
 Comparing ufs.cpld.cpl.hi.2008-08-28-00000.nc ............ALT CHECK......NOT OK

 0: The total amount of wall time                        = 655.100498
 0: The maximum resident set size (KB)                   = 239148

Test 004 coastal_ike_shinnecock_atm2sch2ww3_intel FAIL
uturuncoglu commented 5 months ago

@josephzhang8 @platipodium I check this little bit more. It seems that ocean mask in the SCHISM export state is fine. I still need to check it in the mediator side to be sure that it is received correctly. I'll look at it later and update you if I find something new.

uturuncoglu commented 5 months ago

@josephzhang8 @platipodium @pvelissariou1 @janahaddad Okay. I think I found the source of the issue. In SCHSIM side, we are defining ocean_mask export field as integer but mediator expects it in double. So, this was creating problem when mediator defining the ocean fraction. I quickly hacked the SCHSIM to provide double rather than integer and that fixed the issue in the mediator side and I could able to see ocean mask and fraction correctly. I still did not implement the fix but maybe I could create temporary variable to create double version of idry mask variable and pass that one to create mask. I am not sure how this is working with atm2sch case. Maybe we are just lucky (or not). This also shows the importance of the robust and multi level testing. It is a small issue but I could not catch it until creating baseline and running configuration against it. Anyway, I will fix it soon and create a PR. Thanks all for your help.

platipodium commented 5 months ago

In SCHSIM side, we are defining ocean_mask export field as integer but mediator expects it in double.

Thanks for pointing this out. We have other ocean models that use mostly integer, but also logical masks. And then there's the issue of convention. Is a non-zero value unmasked (bash-style, 1=FALSE) or is it masked (python style, 1 = TRUE)? For generic coupling

  1. the mediator will have to say what type it accepts and what semantics
  2. the cap will have to provide the correct metadata (original type and semantics)
  3. either one would provide conversion routines.

The different mask types and semantics have been taken care of in the FABM project with preprocessor definitions, but I'd prefer proper metadata and conversion routines

josephzhang8 commented 5 months ago

Thx @uturuncoglu for finding out the source. I agree with @platipodium that this needs to be ironed out carefully. @danishyo had similar issue with FV3<->SCHISM - in that case, he had to convert idry to iwet=1-idry b/c FV3 expects that.

Short of any change in mediator, we can easily create a temp var as Ufuk did to convert idry to double, if I understand correctly.

uturuncoglu commented 5 months ago

@josephzhang8 @platipodium @pvelissariou1 @janahaddad Okay. I implemented a fix and it is in https://github.com/oceanmodeling/schism-esmf/tree/hotfix/ocean_mask branch. You could also see the difference from the following link - https://github.com/oceanmodeling/schism-esmf/pull/new/hotfix/ocean_mask. I created the baseline and run coastal_ike_shinnecock_atm2sch2ww3 against it. Now all files are passing. I think fix is working as expected. Next, I'll look at issue https://github.com/oceanmodeling/schism/issues/4 in the component side that creates issue. Once we solve it too, I and plaining to create two PR one in model and other one is in nuopc interface side. In the mean time if you could test/review hotfix/ocean_mask branch in your side that would be great.

platipodium commented 5 months ago

Thanks for putting together this hot fix @uturuncoglu. I noticed that we have now a (likely good, but not yet tested) solution for source node meshes line 575: if (isPresent .and. meshloc == ESMF_MESHLOC_NODE). We would not be able to deal with source element meshes, but I guess that is not a realistic case? If possible, a simple error for this case "not implemented" would be useful.

uturuncoglu commented 5 months ago

@platipodium It seems that if it is element then we are removing ghosts and then there is no need to call halo update since there is nothing to update. I think this is only valid if we have node based mesh that has ghost points. Anyway, maybe this is not the answer but if you could clarify little bit more that would be helpful.

platipodium commented 5 months ago

@platipodium It seems that if it is element then we are removing ghosts and then there is no need to call halo update

That seems a reasonable analysis. I agree we don't need to deal with the ESMF_MESHLOC_ELEMENT case.

uturuncoglu commented 5 months ago

@platipodium @josephzhang8 JFYI, we might still issue. I am investigating now. Please hold on.

uturuncoglu commented 5 months ago

@platipodium @josephzhang8 Okay. I think I found it. We need to specialize model_label_DataInitialize in SCHSIM cap and set So_mask in there since the mask is used in mediator DataInitialize step. At this point, we are providing it in the Advance step and mediator could not reach that information since requires mask before advance step. I also checked the MOM6 cap and the mask is set in its DataInitialize phase. Anyway, we could discuss it more when we have a call but utimate fix is on its way. I'll update you when it is ready.

uturuncoglu commented 4 months ago

The RT baseline check issue is fixed by the restructuring the SCHSIM NUOPC cap. The PR (https://github.com/schism-dev/schism-esmf/pull/30) is merged. I am closing this issue.

platipodium commented 4 months ago

thanks @uturuncoglu !