Discussion: Parallel systems execution

kvark commented 9 years ago

@mtsr mentioned parallelism in #2. Current systems do not allow that.

I have no idea how this is supposed to work, and I need to do more research to see clearer. Perhaps, someone could enlighten me? I do understand that components arrays can be moved between threads, if needed, but not sure how to resolve access conflicts between systems in this case. Also COW idiom seems to be related to the problem.

When there is a big challenge, there is also a great opportunity. I hope we might come up with a good architecture for parallel systems - they are crucial for ECS scalability and to prove Rust being worthy in game-dev.

kvark commented 9 years ago

Just a general outline before we start chopping things down:

each system needs some components for reading and others for writing, these are defined at compile time
each system can remove entities or add new
systems may work simultaneously as long every component that is written to by one system doesn't get read or written by any other. This is similar to how Rust borrowing rules work. Perhaps, we could piggy-back on it somehow?

I can think of two different types of systems ordering:

Systems go processing in the same order as they are given (in a Vec<System>). The next system may need to wait if it conflicts with any systems that are already in flight.
Systems may be rescheduled to do most reads after the writes, hence reducing latency. This is gonna be tough to implement.

@HeroesGrave, does your ECS support simultaneous processing? How do its rules fare against the model I described?

kvark commented 9 years ago

There are other ways, it seems. One that I particularly like is described on reddit:

All data that is shared between systems is held in a central location.
Jobs are entered into a task queue, and are only given immutable references to data.
They generate 'changesets', which are basically deferred mutations of the data. For example, if you have a bunch of location components, and you want to move 3 of them, you'd have some data structure that holds the diff along with what location component to apply it to.
Once all of those jobs are done, add jobs onto the queue that take the recently generated changesets and applies them to the data.

The obvious benefit is having all systems running in parallel. Each component storage will accumulate changes and then apply them (again) in parallel to other components, after all the systems are done. One global change list will also store all the changes to the entity array itself.

The less obvious benefit is simplification of all the code. There is no need to specify at compile time which components are affected by which systems. Most of the implementation can be done out of the macros. This makes it by far my best bet at accomplishing the subject task.

kvark commented 9 years ago

I found a similar issue for ecs-rs: https://github.com/HeroesGrave/ecs-rs/issues/10

kvark commented 9 years ago

@mtsr - do you have an example where a system would like to run on more than a single thread?

I slowly drifting towards realization that the parallel story of ECS is just a sub-problem of generic work queues. In a simple case, a system is just a work unit, and its dependencies may be other systems. The ECS should just support change lists natively, so that processing it in parallel is possible.

mtsr commented 9 years ago

A system is a work unit, but might also have it’s own child work units that it depends on. That’s mostly it, I think. That way the system can also split it's work across threads, as there's definitely use cases for that.

ghost commented 9 years ago

There are some other neat advantages to the changeset model. A good example being that changesets can be forked and fed into multiple components of the system. Collision detection and Rendering both need to know when the objects positions has been changed, but both need to represent it in a different way from each other.

Rendering is actually a good example of where there is quite a bit that can be gained from the changeset model. Updating a texture or a model using the changeset allows the render to update only the data is modified with each frame. The alternative is the Render running through the list of meshes, and looking at dirty flags. Changesets are a list of dirty entities.

kvark commented 9 years ago

@csherratt your vision on changelists is quite an eye-opener. It does seem beneficial to have collision and rendering accessing their changelists. So is this how this may work?

all systems generate the changes in parallel, record them to the corresponding component storages
all systems are given read-only access to all the changelists, in order to update themselves in parallel
the changelists get applied to the corresponding components, again in parallel. This may be happening at the same time as the previous step if systems don't need to access the components.
the changelists are cleared up (instant action, no need to parallelize)

On the other hand, as @mtsr has mentioned on IRC, changelist model seems to introduce latency. If we have a dependency path of N nodes (input -> physics -> rendering), then the output will only be completed in N updates, which is more than I can tolerate.

ghost commented 9 years ago

@kvark There are three ways I can see architected. The first is that the pipeline is feed once per frame, and propagates to each on a frame update. This has the nice advantage of letting the pipeline run in parallel, but sucks because every stage introduces a full frame of latency. This can be mitigated by increasing the framerate from 60 to 200 or so, since each stage is now running for 5ms instead of 16. The render could also throw out stale frames if it wanted to. I'm not a big fan of this architecture, Its what I first thought of in snowmew and have come to see the trades off as unacceptable.

The second idea, what I purposed in snowstorm gist. Basically, a pipeline that changes are propagated as they are created. Systems are run when all the writers are done writing. And since a system is a writer itself, this propagates like a wave through the systems. This gets around the latency problem since the updates all happen sequentially. The big issue with this type of system is that the worst case system, one where each system chains off the next, is going to be slower then if it were written as a single threaded program. The other big trade off is that this is expensive from a memory point of view. Each system has its own copy of everything + everything exists in a channel somewhere.

Lastly, based on the reddit link you provided. The ECS keeps all state in a central location. All updates to that blob are committed after all the workers complete. The amendment I had was that updates don't need to have a 1:1 mapping with a component they are updating. The position update is a good example of this, since it is relevant to a few systems. What I purpose is that a changeset could have a 1:N mapping, where multiple data-structures were updated inside of the central blob. So changing a position would not only update the position table, but also update a BVH which was used for render culling. I think this architecture could still be pipelines, so that each system can fan-out to do its updates to the central ECS. It's not a free lunch of course, changsets creating and consumption takes time.

mtsr commented 9 years ago

I really like the snowstorm approach. Lots of parallelism, change sets reduce what needs to be processed (like creation and change events in EntityX for example enable) and allow for system-specific data-structures for stuff. Together with a good scheduler that knows about frame deadlines (or at least prioritises work-units belonging the next frame), it could be as open as anything.

Dependency chains aren't a problem since we pass the data through the whole chain every frame, although there might still be valid uses that need an upstream system to process (which incurs a frame-delay).

The key issue is interesting, though, but I think there might be solutions. Since every entity processed by a system is either coming from upstream or being created there, we just need to have uniqueness per system and tag them with the system that created them. Still no nicely contiguous entities, or at best contiguous per creating system.

We might still look at something like Tharsis' design, which rewrites a packed array everything every frame. Whether grouped by system or component or something else, I don't know,.

kvark / simplecs

Discussion: Parallel systems execution #9