Closed arporter closed 2 months ago
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 99.86%. Comparing base (
0820679
) to head (7a4c91b
).
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Ready for review from @hiker or @sergisiso or @TeranIvy. I'll fire off the integration tests now.
Strangely, the ECMWF/OMP run of NEMO in the integration tests returned an error code of 1, despite everything appearing to be fine. I suggest we re-check once the reviewing is done.
@arporter I re-runed the job that failed the other day and still failed today while seemingly completing the action and being not related to the code here. I will explore a bit more before deciding what to do. I launch a master integration test now to double check.
@arporter Unfortunately there is still something strange here. I launched integration tests for master, and they succeeded, and then again for the last commit of this PR, and it fail again. We need to explore what the issue is.
Ok. I'll take a look. Thanks for double checking with master.
The failure is because the values of the 'answers' have changed:
it : 10 |ssh|_max: 0.1997143892083802D+01 |U|_max: 0.1484097336815946D+01 S_min: 0.1085307757483717D+01 S_max: 0.4040450495987173D+02
vs
it : 10 |ssh|_max: 0.1992739578293067D+01 |U|_max: 0.1480481366230695D+01 S_min: 0.1091403831970652D+01 S_max: 0.4040450495350832D+02
The check we are using is just e.g. tail -n 1 output.txt | grep -q "|ssh|_max: 0.199714"
which is why the job is simply flagged as failing without any explanatory text! We should do a bit better than this.
I can't see how what this PR has done could have changed the answers. However, it could be that the results are not reproducible run-to-run because of OMP threading?
I've logged in and repeatedly run the NEMO binary built by the last run of the CI. The first three or four runs all gave the right answer and then I got another one with it : 10 |ssh|_max: 0.1992739578293067D+01 |U|_max: 0.1480481366230695D+01 S_min: 0.1091403831970652D+01 S_max: 0.4040450495350832D+02
. It therefore does look as though we do not have run-to-run reproducibility when threading is enabled.
This is running without MPI, as confirmed by examining the ocean.output
. Since we don't support OpenMP reductions yet (@sergisiso can correct me), I think this must mean we are generating incorrect code somewhere such that the answer can vary from run to run:
[gh_runner@glados EXP00]$ ./nemo
0
[EXP00]$ tail run.stat
it : 1 |ssh|_max: 0.1981591870348679D+01 |U|_max: 0.1450047508366849D+01 S_min: 0.1085786592269819D+01 S_max: 0.4040453941664692D+02
it : 2 |ssh|_max: 0.1982752270897651D+01 |U|_max: 0.1452597101488951D+01 S_min: 0.1086240692446414D+01 S_max: 0.4040453356389120D+02
it : 3 |ssh|_max: 0.1983728378860238D+01 |U|_max: 0.1456033972148181D+01 S_min: 0.1086709090430587D+01 S_max: 0.4040453148092196D+02
it : 4 |ssh|_max: 0.1985032463391237D+01 |U|_max: 0.1458752676798451D+01 S_min: 0.1087192969684326D+01 S_max: 0.4040452638526664D+02
it : 5 |ssh|_max: 0.1987016011252258D+01 |U|_max: 0.1461991884818293D+01 S_min: 0.1087927288751745D+01 S_max: 0.4040452369825938D+02
it : 6 |ssh|_max: 0.1988687809217617D+01 |U|_max: 0.1464525354307282D+01 S_min: 0.1089032236567440D+01 S_max: 0.4040451908376322D+02
it : 7 |ssh|_max: 0.1990265207831180D+01 |U|_max: 0.1466978371608125D+01 S_min: 0.1090261073753903D+01 S_max: 0.4040451600903567D+02
it : 8 |ssh|_max: 0.1991024605059977D+01 |U|_max: 0.1469837868004808D+01 S_min: 0.1091143424902952D+01 S_max: 0.4040451251993338D+02
it : 9 |ssh|_max: 0.1991839479539447D+01 |U|_max: 0.1476103197378707D+01 S_min: 0.1091501632524920D+01 S_max: 0.4040450854133136D+02
it : 10 |ssh|_max: 0.1992739578293067D+01 |U|_max: 0.1480481366230695D+01 S_min: 0.1091403831970652D+01 S_max: 0.4040450495350832D+02
[gh_runner@glados EXP00]$ ./nemo
0
[EXP00]$ tail run.stat
it : 1 |ssh|_max: 0.1982273303674564D+01 |U|_max: 0.1450121491443233D+01 S_min: 0.1085442700023809D+01 S_max: 0.4040453941666587D+02
it : 2 |ssh|_max: 0.1984024621757109D+01 |U|_max: 0.1452825071445360D+01 S_min: 0.1085448017971859D+01 S_max: 0.4040453356397503D+02
it : 3 |ssh|_max: 0.1986609691379131D+01 |U|_max: 0.1456493639861315D+01 S_min: 0.1085468392706552D+01 S_max: 0.4040453148195538D+02
it : 4 |ssh|_max: 0.1989265369848474D+01 |U|_max: 0.1459493046583662D+01 S_min: 0.1085536080926007D+01 S_max: 0.4040452638693321D+02
it : 5 |ssh|_max: 0.1991058793825424D+01 |U|_max: 0.1462901396603791D+01 S_min: 0.1085593707188744D+01 S_max: 0.4040452370138181D+02
it : 6 |ssh|_max: 0.1992391646218621D+01 |U|_max: 0.1465286602430546D+01 S_min: 0.1085569560501641D+01 S_max: 0.4040451908764371D+02
it : 7 |ssh|_max: 0.1993486308863192D+01 |U|_max: 0.1468547453847026D+01 S_min: 0.1085470698091813D+01 S_max: 0.4040451601402300D+02
it : 8 |ssh|_max: 0.1994310962659264D+01 |U|_max: 0.1473147728648601D+01 S_min: 0.1085428657280329D+01 S_max: 0.4040451252509232D+02
it : 9 |ssh|_max: 0.1995525151105640D+01 |U|_max: 0.1479540291553064D+01 S_min: 0.1085345073646775D+01 S_max: 0.4040450854739036D+02
it : 10 |ssh|_max: 0.1997143892083802D+01 |U|_max: 0.1484097336815946D+01 S_min: 0.1085307757483717D+01 S_max: 0.4040450495987173D+02
Could this be due to (or exacerbated by) the new collapsing functionality/update in #2660?
HoistTrans
has also been update in #2662. In theory that should have improved things, not made them worse but you never know...
Anyway, I'll open a new Issue for this problem as it's not the fault of this PR. Back to @sergisiso.
I've removed the utility method entirely because it's only called from one location and that has meant I can remove all of the associated test code too.