OpenMP version of NEMO is not run-to-run reproducible

stfc / PSyclone

Domain-specific compiler and code transformation system for Finite Difference/Volume/Element Earth-system models in Fortran

BSD 3-Clause "New" or "Revised" License

107 stars 29 forks source link

OpenMP version of NEMO is not run-to-run reproducible #2676

Closed arporter closed 3 months ago

arporter commented 3 months ago

As discovered in #2673: https://github.com/stfc/PSyclone/pull/2673#issuecomment-2253014754

Essentially, re-running the same binary on the same number of OMP threads can give different answers. (This is the 'ECMWF' version of NEMO in the integration tests.) We first saw this failure on 26/07/2024 at 10:08:13. Presumably, it is therefore due to (or exacerbated by) something that went onto master just before that.

arporter commented 3 months ago

Just to be clear: since we don't support OpenMP reductions for general PSyIR, there shouldn't be any OpenMP reductions in the code base. (These are not guaranteed run-to-run reproducible by the OpenMP standard.) Therefore, the fact that we are getting variability probably means we've introduced a race condition into the transformed code.

sergisiso commented 3 months ago

As this does not happen for the OpenMP offloading region (and this one even has atomics that can change the order of operations), I would suspect the ICE regions.

Specifically the changes done to examples/nemo/scripts/utils.py in #2660 - which I thought they were making it more strict (the collapse itself is disabled for CPU)

sergisiso commented 3 months ago

Integration tests show a performance improvement for #2660 which I was not expecting, so the additional loops parallelised there are probably the cause.

Also we need a clearer indicator that the results diff failed (now just says action ended with exit error 1)

sergisiso commented 3 months ago

The reason is that we check that there are no calls in the region_directive, but not the parallel one. And therefore the CPU version parallelised loops with calls inside it. This is mostly fine (as the subroutines exist in the CPU), but there are some places where this are impure (have race conditions inside the call), but this gets parallelised anyway in the CPU.

There are another couple of issues with the new inline comments that I will also fix, but these do not affect correctness.

sergisiso commented 3 months ago

Some strange behaviour in LFRic is due that we don't clean up the previous version properly, because we have 3 p's in:

 # Clean previous version and compile again
 rm -rf appplications/gungho_model/working