ufs-community / ufs-weather-model

UFS Weather Model
Other
134 stars 243 forks source link

Aerosol-related PET log error message #1888

Open DeniseWorthen opened 1 year ago

DeniseWorthen commented 1 year ago

Description

I compiled and ran both the cpld_control_nowave_noaero_p8 and cpld_bmark_p8 in debug mode using @ulmononian's feature/add_c5 branch. Both tests run, but the bmark test (which has aerosols) produces an PET log error message on PEs 000 through 0766. There is no error message in the 0767 log. The aerosol component runs on PEs 0:767. I don't know why the last Aerosol PE does not produce the error.

The error message is:

20230907 124949.789 INFO             PET0766 UFS Aerosols: Advancing from 2013-04-01T00:00:00 to 2013-04-01T00:05:00
20230907 125011.127 ERROR            PET0766 ESMCI_Info.C:668 Info::erase() Not found  - [json.exception.out_of_range.403] key 'GridCornerLons:' not found
20230907 125011.127 ERROR            PET0766 ESMCI_Info.C:688 Info::erase() Not found  - Internal subroutine call returned Error
20230907 125011.127 ERROR            PET0766 ESMC_InfoCDef.C:243 ESMC_InfoErase() Not found  - Internal subroutine call returned Error
20230907 125011.127 ERROR            PET0766 ESMF_Info.F90:2656 ESMF_InfoRemove() Not found  - Internal subroutine call returned Error
20230907 125011.127 ERROR            PET0766 src/Superstructure/AttributeAPI/interface/ESMF_Attribute.F90:46022 ESMF_AttributeRemoveAttPackGrid( Not found  - Internal subroutine call returned Error
20230907 125011.127 ERROR            PET0766 ESMCI_Info.C:668 Info::erase() Not found  - [json.exception.out_of_range.403] key 'GridCornerLats:' not found
20230907 125011.127 ERROR            PET0766 ESMCI_Info.C:688 Info::erase() Not found  - Internal subroutine call returned Error
20230907 125011.127 ERROR            PET0766 ESMC_InfoCDef.C:243 ESMC_InfoErase() Not found  - Internal subroutine call returned Error
20230907 125011.127 ERROR            PET0766 ESMF_Info.F90:2656 ESMF_InfoRemove() Not found  - Internal subroutine call returned Error
20230907 125011.127 ERROR            PET0766 src/Superstructure/AttributeAPI/interface/ESMF_Attribute.F90:46022 ESMF_AttributeRemoveAttPackGrid( Not found  - Internal subroutine call returned Error
20230907 125011.127 ERROR            PET0766 ESMCI_Info.C:668 Info::erase() Not found  - [json.exception.out_of_range.403] key 'GridCornerLons:' not found
20230907 125011.127 ERROR            PET0766 ESMCI_Info.C:688 Info::erase() Not found  - Internal subroutine call returned Error
20230907 125011.127 ERROR            PET0766 ESMC_InfoCDef.C:243 ESMC_InfoErase() Not found  - Internal subroutine call returned Error
20230907 125011.127 ERROR            PET0766 ESMF_Info.F90:2656 ESMF_InfoRemove() Not found  - Internal subroutine call returned Error
20230907 125011.127 ERROR            PET0766 src/Superstructure/AttributeAPI/interface/ESMF_Attribute.F90:46022 ESMF_AttributeRemoveAttPackGrid( Not found  - Internal subroutine call returned Error
20230907 125011.127 ERROR            PET0766 ESMCI_Info.C:668 Info::erase() Not found  - [json.exception.out_of_range.403] key 'GridCornerLats:' not found
20230907 125011.127 ERROR            PET0766 ESMCI_Info.C:688 Info::erase() Not found  - Internal subroutine call returned Error
20230907 125011.127 ERROR            PET0766 ESMC_InfoCDef.C:243 ESMC_InfoErase() Not found  - Internal subroutine call returned Error
20230907 125011.127 ERROR            PET0766 ESMF_Info.F90:2656 ESMF_InfoRemove() Not found  - Internal subroutine call returned Error
20230907 125011.127 ERROR            PET0766 src/Superstructure/AttributeAPI/interface/ESMF_Attribute.F90:46022 ESMF_AttributeRemoveAttPackGrid( Not found  - Internal subroutine call returned Error
20230907 125021.303 INFO             PET0766 Model Advance: before wrtcomp run

To Reproduce:

compile and run the feature/add_c5 branch using DEBUG for the bmark P8 test.

Additional context

Output

ulmononian commented 1 year ago

i had 100% of tests pass about a week ago or so on c5. so the debug test runs to completion here but you are seeing these errors anyway?

DeniseWorthen commented 1 year ago

Yes, the test runs, so it is not fatal. I don't know what it means actually. It must be in MAPL?

natalie-perlin commented 10 months ago

@DeniseWorthen @ulmononian - does spack-stack need to have esmf-debug module for running the cpld_control_nowave_noaero_p8 and cpld_bmark_p8 in debug mode ?

DeniseWorthen commented 10 months ago

No, we no longer use esmf built w/ debug.

zach1221 commented 10 months ago

Hi, @DeniseWorthen . Below are my experiment directories for cpld_control_nowave_noaero_p8_intel & cpld_bmark_p8_intel on Gaea C5, if you want to take a look at the PET logs. If you dont have access to view let me know.

/lustre/f2/scratch/Zachary.Shrader/FV3_RT/rt_76751/cpld_control_nowave_noaero_p8_intel /lustre/f2/scratch/Zachary.Shrader/FV3_RT/rt_76751/cpld_bmark_p8_intel

DeniseWorthen commented 10 months ago

@zach1221 The same behaviour appears to be present. No PET error messages in the cpld_control_nowave_noaero_p8_intel test, which does not include aerosols. In the cpld_bmark_p8_intel test, the only PET log which does not contain the error message is PET0767 (the last atm PET).

natalie-perlin commented 9 months ago

@DeniseWorthen @zach1221

DeniseWorthen commented 9 months ago

@zach1221 @jkbk2004 The behavior on hercules w/rt these two cases is the same on Hercules. That is, there is no ERROR on the test w/o aerosols, ie, cpld_control_noaero_p8_intel. There is an PET log error on all but the last ATM PE for the cpld_bmark_p8 test. I also checked the low-resolution case, cpld_control_p8_intel and it shows the same thing---ERROR on all but the last ATM PE. Note, Aerosols run on the same PEs as the ATM. I believe this most likely a MAPL issue and not ESMF.

natalie-perlin commented 9 months ago

@DeniseWorthen - A working directory on Gaea-c5 with the cpld_bmark_p8 test is /lustre/f2/scratch/ncep/Natalie.Perlin/FV3_RT/rt_221558/cpld_bmark_p8_intel In case you may take a look to see whether this is the same behavior.

DeniseWorthen commented 9 months ago

@natalie-perlin I had checked earlier w/ @zach1221 run directories and confirmed that the same error message is present as in the original issue. I also later confirmed that the same message is present on a hercules run so it isn't specific to C5. I believe it to be a MAPL issue most likely.

jkbk2004 commented 9 months ago

I think GSFC people might be able to access to hercules. It will be a great starting point if we can set up a MAPL debugging installation and experiment. @mathomp4 Is Jiang still available?

mathomp4 commented 9 months ago

I think GSFC people might be able to access to hercules. It will be a great starting point if we can set up a MAPL debugging installation and experiment. @mathomp4 Is Jiang still available?

@jkbk2004 Yes. @weiyuan-jiang should be able to help. That said, I'm not sure he (or any of us) have access to Hercules. We do have access to Orion.

weiyuan-jiang commented 9 months ago

I think I have access to Hercules. Judging from the logging error, it is more like a problem from ESMF ( due to the building ? ). But I will take a look at it.

DeniseWorthen commented 9 months ago

@weiyuan-jiang Thanks for checking. The reason I suspect MAPL is that we do not see the error in cases w/o the aerosol component.

weiyuan-jiang commented 9 months ago

What version of gocart and MAPL are you using? @DeniseWorthen

DeniseWorthen commented 9 months ago

I'm not sure how to tell which version of gocart we are using, but for MAPL we are using 2.35.2-esmf-8.4.2

jkbk2004 commented 9 months ago

GOCART hash: https://github.com/GEOS-ESM/GOCART/tree/041422934cae1570f2f0e67239d5d89f11c6e1b7

zach1221 commented 9 months ago

Looks like 2.1.1 for gocart

weiyuan-jiang commented 9 months ago

I have talked to Ben. He thought there might be a small chance that an attribute is not there in the MAPL_GetHorzIJIndex call. ( It is confusing though, because it should not produce error) . Anyway, I have replaced that call in the branch . Would you please try this new MAPL?

DeniseWorthen commented 9 months ago

Since this error is not specific to C5 (it also shows up on Hercules), I've edited the issue title.