Proposal: Improved Handling of Parallel Processes

U8NWXD commented 2 years ago

Current Approach

Message Passing

When a process has Process.parallel set to true, we launch a parallel OS process that runs vivarium.core.process._run_update(). run_update() contains a loop that repeatedly receives (interval, state) tuples from a pipe, passes those arguments to Process.next_update(), and sends the result back through the pipe. To stop the parallel process, we pass interval=-1.

Tracking in Main OS Process

In the main OS process (which contains Engine), we store ParallelProcess instances in Engine.parallel. Then when we need an update from a process, we call Engine._invoke_process, which does the following:

If the process is parallel, starts the ParallelProcess computing the update and returns it.
If the process is non-parallel, creates and returns an instance of InvokeProcess, which has an interface similar to ParallelProcess but computes the update immediately. Note that there's an extra invoke_process function that computes the process's update, but this extra level of indirection appears unnecessary.

Problems

The way we currently track parallel processes has a number of downsides:

We store ParallelProcess instances in Engine.parallel, but we also store the original Process instances in the store hierarchy (with a reference in self.processes).
1. This is unnecessary and wastes memory, especially for processes with large parameter dictionaries or internal states.
2. Having the original Process instances in the store hierarchy is confusing. A user can read out the internal state of those processes with no problem, but they're getting the state from a process that hasn't changed since the beginning of the simulation, which is not intuitive.
There's no way to retrieve any information from a process besides its next update. This is a problem when we want to read out a process's internal state (e.g. for debugging or when working with inner simulations).
The extra levels of indirection are confusing. Every time I work on this, I have to trace through the code again to remind myself what all the "invoke" things do.

Proposed Solution

Eliminate the extra invoke_process function that doesn't appear to do anything.
Instead of storing ParallelProcess instances in self.parallel, put them directly into the store hierarchy with references in self.processes. Once a parallel process has been put on its own OS process, there should be no copies of it left in the main OS process.
Systematize message-passing between the main and parallel OS processes as commands with arguments. Process will have a run_command() method that accepts a command name (string) and a tuple of arguments. The _run_update() function would handle the following commands:
1. _halt: Ignores arguments and shuts down the parallel process (equivalent of passing interval=-1 currently)
2. _next_update: Takes (interval, state) as arguments and passes them to Process.next_update()
Process authors can override run_command() to handle more commands, e.g. to return internal state variables.

ParallelProcess will also provide run_command(), but instead of running commands itself, it will pass those commands through the pipe to its child process's _run_update() function, which will in turn pass them to Process.run_command().

I've started implementing this in #198

I think this proposal addresses all the problems described above. However, it brings some new downsides:

Processes in the store hierarchy will not all be Process instances anymore. Some will be ParallelProcess instances. We don't want to make ParallelProcess inherit from Process because it doesn't know enough to do things like generate composites, and adding those missing capabilities is overkill.
This would change the behavior of public functions. I think this is really more of a bug fix than a breaking change since the way we currently store Process instances in the store hierarchy is very counter-intuitive. The biggest breaking change is probably removing Engine.parallel, but I doubt anyone relies on that.

prismofeverything commented 2 years ago

Hey @U8NWXD, this is a great proposal. The existing parallel implementation was really just the first step I had in mind and more an expedient for getting large(r) colonies, I'm glad you're revisiting this.

I think the key you mention is making a kind of "process protocol", what you are calling Process Commands. This leads into supporting a general rpc-like api for querying and interacting with processes which could be the basis for a distributed api where each process (or engine) can run on it's own machine, coordinating through this api.

I think the protocol (let's call it a protocol) would involve the functions you mention, and in addition some more members of the Process interface, namely:

calculate_timestep
ports_schema
next_update (currently the only one supported)
end (you called it halt, it is called end in Engine.... either way)
inspect (retrieve any relevant internal state report)

Maybe others? I think supporting __init__ would be great too, but that seems more like an Engine method which would have the ability to "allocate" new process instances.

I've already been plotting a bit along these lines so could be worthwhile discussing IRL. Or we can hash it all out here and in your other PR.

U8NWXD commented 2 years ago

I have a rough implementation in #198 that passes our existing tests. I still need to do some more comprehensive testing though. I decided to support all of the existing Process methods in the process commands protocol, which lets us make ParallelProcess inherit from Process. I think that will make the implementation in Engine cleaner: there will be lots of Process instances in the store hierarchy. Some of them will happen to be ParallelProcess instances, but Engine will, for the most part, be able to treat them as normal Process objects

eagmon commented 2 years ago

I like the proposed "Process Commands" -- agreed with Ryan that this will help beyond simplifying parallel processes in that it declares a general protocol. I also find invoke_process confusing, and have to trace through the code every time.

It looks like you have worked through some of the listed downsides on the draft PR? ParallelProcess is now inheriting from Process? This seems correct.

That would be great to keep ParallelProcesses in the store hierarchy, and in Engine.processes -- a nice simplification that will save a lot of confusion. You can remove Engine.parallel without breaking any of the projects that I have helped with, which is most of them. So this won't be a problem.

So yes, this proposal looks good!

vivarium-collective / vivarium-core