(Closes #2652) remove lfric.utils.get_metadata_name_from_module() utility and query PSyIR instead

arporter commented 2 months ago

I've removed the utility method entirely because it's only called from one location and that has meant I can remove all of the associated test code too.

codecov[bot] commented 2 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 99.86%. Comparing base (0820679) to head (7a4c91b).

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #2673 +/- ## ========================================== - Coverage 99.86% 99.86% -0.01% ========================================== Files 352 352 Lines 48706 48702 -4 ========================================== - Hits 48641 48637 -4 Misses 65 65 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

arporter commented 2 months ago

Ready for review from @hiker or @sergisiso or @TeranIvy. I'll fire off the integration tests now.

arporter commented 2 months ago

Strangely, the ECMWF/OMP run of NEMO in the integration tests returned an error code of 1, despite everything appearing to be fine. I suggest we re-check once the reviewing is done.

sergisiso commented 2 months ago

@arporter I re-runed the job that failed the other day and still failed today while seemingly completing the action and being not related to the code here. I will explore a bit more before deciding what to do. I launch a master integration test now to double check.

sergisiso commented 2 months ago

@arporter Unfortunately there is still something strange here. I launched integration tests for master, and they succeeded, and then again for the last commit of this PR, and it fail again. We need to explore what the issue is.

arporter commented 2 months ago

Ok. I'll take a look. Thanks for double checking with master.

arporter commented 2 months ago

The failure is because the values of the 'answers' have changed:

it : 10 |ssh|_max: 0.1997143892083802D+01 |U|_max: 0.1484097336815946D+01 S_min: 0.1085307757483717D+01 S_max: 0.4040450495987173D+02

vs

it : 10 |ssh|_max: 0.1992739578293067D+01 |U|_max: 0.1480481366230695D+01 S_min: 0.1091403831970652D+01 S_max: 0.4040450495350832D+02

The check we are using is just e.g. tail -n 1 output.txt | grep -q "|ssh|_max: 0.199714" which is why the job is simply flagged as failing without any explanatory text! We should do a bit better than this.

I can't see how what this PR has done could have changed the answers. However, it could be that the results are not reproducible run-to-run because of OMP threading?

arporter commented 2 months ago

I've logged in and repeatedly run the NEMO binary built by the last run of the CI. The first three or four runs all gave the right answer and then I got another one with it : 10 |ssh|_max: 0.1992739578293067D+01 |U|_max: 0.1480481366230695D+01 S_min: 0.1091403831970652D+01 S_max: 0.4040450495350832D+02. It therefore does look as though we do not have run-to-run reproducibility when threading is enabled.

arporter commented 2 months ago

This is running without MPI, as confirmed by examining the ocean.output. Since we don't support OpenMP reductions yet (@sergisiso can correct me), I think this must mean we are generating incorrect code somewhere such that the answer can vary from run to run:

[gh_runner@glados EXP00]$ ./nemo
0
[EXP00]$ tail run.stat
 it :       1    |ssh|_max:  0.1981591870348679D+01 |U|_max:  0.1450047508366849D+01 S_min:  0.1085786592269819D+01 S_max:  0.4040453941664692D+02
 it :       2    |ssh|_max:  0.1982752270897651D+01 |U|_max:  0.1452597101488951D+01 S_min:  0.1086240692446414D+01 S_max:  0.4040453356389120D+02
 it :       3    |ssh|_max:  0.1983728378860238D+01 |U|_max:  0.1456033972148181D+01 S_min:  0.1086709090430587D+01 S_max:  0.4040453148092196D+02
 it :       4    |ssh|_max:  0.1985032463391237D+01 |U|_max:  0.1458752676798451D+01 S_min:  0.1087192969684326D+01 S_max:  0.4040452638526664D+02
 it :       5    |ssh|_max:  0.1987016011252258D+01 |U|_max:  0.1461991884818293D+01 S_min:  0.1087927288751745D+01 S_max:  0.4040452369825938D+02
 it :       6    |ssh|_max:  0.1988687809217617D+01 |U|_max:  0.1464525354307282D+01 S_min:  0.1089032236567440D+01 S_max:  0.4040451908376322D+02
 it :       7    |ssh|_max:  0.1990265207831180D+01 |U|_max:  0.1466978371608125D+01 S_min:  0.1090261073753903D+01 S_max:  0.4040451600903567D+02
 it :       8    |ssh|_max:  0.1991024605059977D+01 |U|_max:  0.1469837868004808D+01 S_min:  0.1091143424902952D+01 S_max:  0.4040451251993338D+02
 it :       9    |ssh|_max:  0.1991839479539447D+01 |U|_max:  0.1476103197378707D+01 S_min:  0.1091501632524920D+01 S_max:  0.4040450854133136D+02
 it :      10    |ssh|_max:  0.1992739578293067D+01 |U|_max:  0.1480481366230695D+01 S_min:  0.1091403831970652D+01 S_max:  0.4040450495350832D+02
[gh_runner@glados EXP00]$ ./nemo
0
[EXP00]$ tail run.stat
 it :       1    |ssh|_max:  0.1982273303674564D+01 |U|_max:  0.1450121491443233D+01 S_min:  0.1085442700023809D+01 S_max:  0.4040453941666587D+02
 it :       2    |ssh|_max:  0.1984024621757109D+01 |U|_max:  0.1452825071445360D+01 S_min:  0.1085448017971859D+01 S_max:  0.4040453356397503D+02
 it :       3    |ssh|_max:  0.1986609691379131D+01 |U|_max:  0.1456493639861315D+01 S_min:  0.1085468392706552D+01 S_max:  0.4040453148195538D+02
 it :       4    |ssh|_max:  0.1989265369848474D+01 |U|_max:  0.1459493046583662D+01 S_min:  0.1085536080926007D+01 S_max:  0.4040452638693321D+02
 it :       5    |ssh|_max:  0.1991058793825424D+01 |U|_max:  0.1462901396603791D+01 S_min:  0.1085593707188744D+01 S_max:  0.4040452370138181D+02
 it :       6    |ssh|_max:  0.1992391646218621D+01 |U|_max:  0.1465286602430546D+01 S_min:  0.1085569560501641D+01 S_max:  0.4040451908764371D+02
 it :       7    |ssh|_max:  0.1993486308863192D+01 |U|_max:  0.1468547453847026D+01 S_min:  0.1085470698091813D+01 S_max:  0.4040451601402300D+02
 it :       8    |ssh|_max:  0.1994310962659264D+01 |U|_max:  0.1473147728648601D+01 S_min:  0.1085428657280329D+01 S_max:  0.4040451252509232D+02
 it :       9    |ssh|_max:  0.1995525151105640D+01 |U|_max:  0.1479540291553064D+01 S_min:  0.1085345073646775D+01 S_max:  0.4040450854739036D+02
 it :      10    |ssh|_max:  0.1997143892083802D+01 |U|_max:  0.1484097336815946D+01 S_min:  0.1085307757483717D+01 S_max:  0.4040450495987173D+02

Could this be due to (or exacerbated by) the new collapsing functionality/update in #2660?

arporter commented 2 months ago

HoistTrans has also been update in #2662. In theory that should have improved things, not made them worse but you never know...

Anyway, I'll open a new Issue for this problem as it's not the fault of this PR. Back to @sergisiso.

stfc / PSyclone

(Closes #2652) remove lfric.utils.get_metadata_name_from_module() utility and query PSyIR instead #2673

Codecov Report