MTCam commented 3 years ago

A restart capability will be required in order to run sufficiently long enough for meaningful flow simulations. We will need these capabilities:

[ ] recording the conserved quantities, time, step number, and some user-defined dependent variables (e.g. temperature) for every point on the discretization for the purpose of restart
[ ] reading the data create in the previous step into simulation data structures
[ ] restarting the simulation with the data read from previous step
[ ] verification the same advanced state is simulated at step N + M regardless of intermediate restarts
[ ] Optionally restarting with a different partitioning (e.g. different number of MPI ranks)

cc: @inducer @anderson2981

inducer commented 3 years ago

Optionally restarting with a different partitioning (e.g. different number of MPI ranks)

I'd vote that this is out of scope for round one. It would take a while to implement, especially if we're not willing to centralize the mesh and DOF data on a single rank.

recording [...] some user-defined dependent variables (e.g. temperature)

Why do we need to save these?

verification the same advanced state is simulated at step N + M regardless of intermediate restarts

For multi-step time integration (i.e. not us, yet), this would entail also saving a good chunk of time stepper state (vs. just re-bootstrapping the time stepper). How important is this "exact restart"?

inducer commented 3 years ago

140

MTCam commented 3 years ago

Optionally restarting with a different partitioning (e.g. different number of MPI ranks)

I'd vote that this is out of scope for round one. It would take a while to implement, especially if we're not willing to centralize the mesh and DOF data on a single rank.

Totally agree. It would be good to leave this on the radar, however. We need to handle changing resource availability for production runs. Even for lead-up science runs; consider the situation where a big resource is used to run several flow-throughs, then a much smaller resource is used to run many "shots" or ignition instances.

recording [...] some user-defined dependent variables (e.g. temperature)

Why do we need to save these?

That's a great question. We can discuss it with JBF, Anderson, and Esteban - perhaps we can do this better (or we already have) - but here is the issue stated in a meta sort of way:

Currently the temperature (T) is calculated as a function of state, and the last temperature (i.e. T = temperature(state, Tguess)).

For Cantera, the user cannot specify Tguess! Cantera just uses the internal state that it kept from the last call of it! Because we use a single instance of Cantera to calculate many points, this means that the answers we get from Cantera depend on partitioning! (i.e. because partitioning affects the point ordering and each call of Cantera just starts its iterations from Tguess = Tlastpoint).

Prometheus does one step better by providing an API to specify Tguess. So for us, Tguess = the temperature that the given point was the last time. Because we store T(i.e. at runtime and at I/O time), we have T available to use for Tguess, but if we don't store it, then our Tguess is lost.

We could just give up and set Tguess = 300 (or whatever is appropriate from user-chosen units) and be done; this soln has some pretty hefty performance implications
or we can accept that we get a slightly different answer when we restart (this verges on unacceptable).
or we can define a function that will get a deterministic value for Tguess (i.e. Tguess = approximate_temp(state)) and use that as Tguess [ my preferred solution ].
or we can write out temperature as a restart quantity and restart it just like state [ current practice ]

verification the same advanced state is simulated at step N + M regardless of intermediate restarts

For multi-step time integration (i.e. not us, yet), this would entail also saving a good chunk of time stepper state (vs. just re-bootstrapping the time stepper). How important is this "exact restart"?

Experience tells me that deterministic restart is quite important, but I can also imagine some cases in which that would not be a show-stopper. We should bring this up with the physics guys.

illinois-ceesd / mirgecom

Need restart #139

140