parthenon-hpc-lab / parthenon

Parthenon AMR infrastructure
https://parthenon-hpc-lab.github.io/parthenon/
Other
112 stars 33 forks source link

Fix reading restarts due to hidden ghost var #1104

Closed pgrete closed 3 months ago

pgrete commented 3 months ago

PR Summary

hasGhost was defined both in restart.hpp and restart_hdf5.hpp We only set the value inside the constructor of the derived class but in the parthenon manager we work with the base class. The latter actually set the data based on bounds defined by hasGhosts so it made a decision based on uninitialized data. On Frontier I could reliably reproduce the issue (both with cray and amdclang/hip compilers), whereas on other machines/compilers hasGhosts was set by default/accident to 0 in the base class (though also rendering reading restarts with ghosts unusable in current develop). Instead of just removing the duplicated declaration I decided to remove public variable entirely and go with a getter function. I also get the data type to int for backwards compatibility and in case we eventually decide to do sth more with this rather than a bool check.

PS: With this (hopefully) last PR, we can finally restart/do some analysis with AthenaPK in sync with Parthenon develop again.

PR Checklist

BenWibking commented 3 months ago

I don't understand how the CI passed with this bug...?

Also: would it make sense to add an action to run the CI with ASAN enabled to catch unitialized value bugs in general?

lroberts36 commented 3 months ago

I don't understand how the CI passed with this bug...?

I think that hasGhost = 0 is the correct value for all of the restart tests, so if it was default initialized to zero then things would work even with the bug.

Also: would it make sense to add an action to run the CI with ASAN enabled to catch unitialized value bugs in general?

I agree that this would be a good idea.