multiscale / muscle3

The third major version of the MUltiScale Coupling Library and Environment
Apache License 2.0
25 stars 13 forks source link

Suspend/resume of submodels #51

Open LourensVeen opened 4 years ago

LourensVeen commented 4 years ago

Stateful micromodels are an issue when it comes to load-balancing a bunch of calls over a smaller number of instances, because you can't just reuse them. If you try, you end up mixing the state of one virtual instance with the inputs for another and get the wrong result. There are other cases where you want to be able to save the state of a submodel and resume it later or somewhere else, such as in the uncertainty control part of the SI-MC algorithm.

You can turn a stateful micromodel into a stateless micromodel by serialising the state and sending it on every O_f, and receiving it on every f_init. You'd have to split off the initialisation code into a separate component, and give the load balancer or SI-MC component or a special save-the-state wrapper an extra set of ports to send the state and receive it at the end of each iteration.

This seems rather ad-hoc though, it may not be obvious what's going on to a casual onlooker, and if the micromodel has a lot of state and a small result, then you don't want to serialise it all the time. In HMC, if there are not enough cores to keep all micromodels running simultaneously then you'll have to, but if there's a large load imbalance then you can still save by keeping the slower simulations running and multiplexing the fast ones, so that the (de)serialisation overhead will not affect the long pole in the tent.

So this needs some thought, but it may be a good idea to have a specific way to suspend and resume submodels. (I don't want to do it yet, but there is fault tolerance to consider as well on very large machines, and that has been part of the long-term plan for MUSCLE 3 from the beginning and has received some thought. Any solution here should also help with that.)

Extra operators outside the reuse loop may be possible, but would complicate the SEL with something that's not needed in many cases, and it would make the reuse and wiring logic even more complicated (not sure it would even make sense any more, actually). It would help with implementing accumulators, but we can solve those in a different way as well.

Another option would be to add an implicit muscle_resume port to f_init, and a corresponding muscle_suspend to O_f. You would then declare support for suspend/resume when creating your Instance, and the code would look something like this:

instance = Instance(..., resume_support=True)

while instance.reuse_instance():
    if instance.resuming():
        msg = instance.resume()
        # set state here from msg
    else:
        msg = instance.receive('in_port')
        # init state here from inputs

    for i in range(num_steps):
        # O_i
        # send message(s) to outside
        # S
        msg = instance.receive('port')
        # update state using received message(s)

    # O_f
    # send final state here as usual

if instance.suspending():
    msg = ... # message containing state
    instance.suspend(msg)

It would be an optional extra from the users' perspective, and it's clear to anyone reading this what we're trying to do.

Implementation-wise, the prereceive would have to listen on both the muscle_resume port and the other ports (if they're connected), expecting to be sent a message either on muscle_resume or on everything else. That would require at least internal support for a receive-first operation, similar to the planned receive-on-any-slot. The implementation for the latter should enable the former as well then, or at least be compatible with it.

Is this overcomplicated? Would it be better if all models exclusively updated state, and state init and extracting QOIs was always done in separate components? From a modular computer science clean code perspective yes, from a performance perspective maybe no (although things could be optimised), but most importantly, this thing has to work in the real world and should not require detangling the pile of Fortran your predecessors left you in order to use it. So the above should be considered seriously; let's mull this over for a bit and see.

LourensVeen commented 2 years ago

Looking at this again, the if instance.suspending(): part at the end should be inside the reuse loop, not after it.

For checkpointing, what if you want to write a checkpoint halfway through? Maybe it should be allowed to call instance.suspend() anywhere? But then the save-the-state wrapper would have to act like an accumulator? Or maybe there'd be an accumulator in between, which keeps the most recent snapshot, and passes it back to the save-the-state wrapper if the micromodel crashes? Or if it reaches the end?

Maybe suspend/resume of stateful micromodels and checkpointing are actually not the same thing?

maarten-ic commented 1 year ago

This idea is (at least partially) implemented by the checkpointing functionality.

Checkpoints are currently stored to disk. A future extension could send the state over special muscle_ ports as well.