illinois-ceesd / mirgecom

MIRGE-Com is the workhorse simulation application for the Center for Exascale-Enabled Scramjet Design at the University of Illinois.
Other
12 stars 19 forks source link

Need restart #139

Open MTCam opened 3 years ago

MTCam commented 3 years ago

A restart capability will be required in order to run sufficiently long enough for meaningful flow simulations. We will need these capabilities:

cc: @inducer @anderson2981

inducer commented 3 years ago

Optionally restarting with a different partitioning (e.g. different number of MPI ranks)

I'd vote that this is out of scope for round one. It would take a while to implement, especially if we're not willing to centralize the mesh and DOF data on a single rank.

recording [...] some user-defined dependent variables (e.g. temperature)

Why do we need to save these?

verification the same advanced state is simulated at step N + M regardless of intermediate restarts

For multi-step time integration (i.e. not us, yet), this would entail also saving a good chunk of time stepper state (vs. just re-bootstrapping the time stepper). How important is this "exact restart"?

inducer commented 3 years ago

140

MTCam commented 3 years ago

Optionally restarting with a different partitioning (e.g. different number of MPI ranks)

I'd vote that this is out of scope for round one. It would take a while to implement, especially if we're not willing to centralize the mesh and DOF data on a single rank.

Totally agree. It would be good to leave this on the radar, however. We need to handle changing resource availability for production runs. Even for lead-up science runs; consider the situation where a big resource is used to run several flow-throughs, then a much smaller resource is used to run many "shots" or ignition instances.

recording [...] some user-defined dependent variables (e.g. temperature)

Why do we need to save these?

That's a great question. We can discuss it with JBF, Anderson, and Esteban - perhaps we can do this better (or we already have) - but here is the issue stated in a meta sort of way:

Currently the temperature (T) is calculated as a function of state, and the last temperature (i.e. T = temperature(state, Tguess)).

For Cantera, the user cannot specify Tguess! Cantera just uses the internal state that it kept from the last call of it! Because we use a single instance of Cantera to calculate many points, this means that the answers we get from Cantera depend on partitioning! (i.e. because partitioning affects the point ordering and each call of Cantera just starts its iterations from Tguess = Tlastpoint).

Prometheus does one step better by providing an API to specify Tguess. So for us, Tguess = the temperature that the given point was the last time. Because we store T(i.e. at runtime and at I/O time), we have T available to use for Tguess, but if we don't store it, then our Tguess is lost.

verification the same advanced state is simulated at step N + M regardless of intermediate restarts

For multi-step time integration (i.e. not us, yet), this would entail also saving a good chunk of time stepper state (vs. just re-bootstrapping the time stepper). How important is this "exact restart"?

Experience tells me that deterministic restart is quite important, but I can also imagine some cases in which that would not be a show-stopper. We should bring this up with the physics guys.