Evaluate parallel design for filter processing

XAMPPRocky commented 3 years ago

We've been discussing over the past couple of months of moving Quilkin's internal architecture from our current iterative model (One context is created per packet, and we run the context through the filter chains until we have the result). To a parallel model, where filters run through all available packets at once, with apparent order (meaning if &mut X is needed by both the first and second filter in the chain, the the system will give access to the first filter, then the second, etc) when components are shared.

I initially proposed going in actor model, however I now think that the best approach to make this parallel is use an "Entity Component System" (ECS) to represent the processing pipeline. There are number of benefits that are specific to ECS that I think we should consider that the main proposal.

Benefits

Moving to using entities and components replaces the current ReadContext and WriteContext structs. Instead we'd have Read and Write "worlds" which contain the context as components associated which entity which is currently "in-flight".
- This allows filters to be parallel by default, as instead of needing to split up the context, or wrap each field in the context structs in an Arc<RwLock<T>>, each filter would declare a query of what component types they want access to with mutability rules, and the ECS framework will resolve the order for us. This also resolves #104 without needing a new trait, as filters who's queries don't conflict will just run in parallel.
- This also enables failing early if a later filter would drop the packet, while the first is still processing.
ECS provides better data locality, as storing entities in a archetype or sparse set generally results in better performance due to the CPU's ability to cache the data. Obviously any PRs should be benched before being accepted, we shouldn't regress.
This would pair really well with #410 since the entire system can run in parallel with the listen distributor task removed.
ECS would also replace our current dynamic_metadata abstraction (closing #305), as filters could insert new types as components, and other filters could query for those filter specific types, rather than needing to a lookup and cast from dyn Any.

Drawbacks

This requires changing the core Filter trait, which will subsequently require updating the existing filter implementations, as well as some of the proxy server's logic.
While ECS is generally more performant in practice, the exact improvements over the status quo, if any right now are unknown.

Implementation Plan

If we go with this architecture, I believe there is an approach to implementing this that will help offset some of the drawbacks mentioned.

Finish #371.
Move each filter over to quilkin test in their own PR.
- This will remove work in migrating some of the tests over to ECS, as the end-to-end tests don't rely on Quilkin internals.
Create a new ecs branch, with the initial changes that works for a single filter.
- Open a new draft PR with it for review.
- Once reviewed, we'd migrate over the filter implementations as PRs to the ecs branch.
- Once all filter implementations have been reviewed and merged in the ecs branch, we'd merge that into main.

Unresolved Questions

Which ECS library to go with? Rust has quite a few ECS libraries, and all of them are good. The prominent libraries in my opinion are bevy-ecs, legion, and shipyard. In my current research it seems like Bevy's ECS with sparse set storage is the fastest for the type workload we want, where we want to create, process, and destroy entities as soon as possible (see https://github.com/bevyengine/bevy/pull/1525 for some benchmark comparisons)

markmandel commented 3 years ago

This is super interesting, and I'm also finding the idea of applying ECS here a quite fascinating use case.

One thing I've never been 100% clear on:

To a parallel model, where filters run through all available packets at once, with apparent order

What benefit are we aiming at getting with parallel processing of Filters over a byte packet? Are we expecting higher throughput through this model? Something else?

My original theory was that if you have multiple packets flowing through the proxy simultaneously, there's no net performance benefit to being able to Filter in parallel, as the system is going to be using every core to run packets in parallel (as opposed to Filters), so there was no real performance benefit to be had, at least in that aspect -- but I could well be missing something here, or the aim of this design is to optimise on something else (it does seem to me like it would reduce mutex locking?).

XAMPPRocky commented 3 years ago

What benefit are we aiming at getting with parallel processing of Filters over a byte packet? Are we expecting higher throughput through this model? Something else?

Well we get parallelism mostly for free with all ECS frameworks, we’ll of course have to measure for performance, but in the abstract running in parallel should reduce work for larger filter chains by allowing earlier termination of the chain.

Imagine a filter chain of a compress filter followed by a rate limit filter. With parallelism, the rate limit filter can flag the packet for deletion while the compress filter is still running. Obviously you wouldn’t want to put those two filters in that order, but it’s an example of different filters in the chain having different amounts work. So while it might only provide minimal to no performance benefits for an already optimised configuration, it provides a more intuitive user experience and better performance for naive configurations because there will be no performance difference for having different amounts work in the filter chains in different orders e.g. (compress, rate-limit) and (rate-limit, compress).

markmandel commented 3 years ago

but in the abstract running in parallel should reduce work for larger filter chains by allowing earlier termination of the chain.

Aaah yes. That makes a lot of sense 👍🏻 Awesome, that clarifies things for me nicely. Thanks!

We should probably write some of our own benchmarks (but from review, it does sound like Sparse Storage in Bevy ECS seems like a winner - and also having pluggable storage seems like something that we could tweak over time as well if we have specific needs of our own based on our use cases). Bevy's documentation isn't as nice as legions' but the docs.rs aren't too bad that we can work it out.

One thing I'm trying to work out when digging into Bevy's ECS is how to manually run the series of systems over entities and components?

Also, for my own edification, how would the flow work within Quilkin? A traditional ECS would execute on each frame tick, and then match that tick to systems depending on how often they should run -- but we don't have a frame tick, it feels like to me like each system should run as a packet is received (maybe???) ? I'm not quite sure how that would work. What are your thoughts there?

XAMPPRocky commented 3 years ago

We should probably write some of our own benchmarks (but from review, it does sound like Sparse Storage in Bevy ECS seems like a winner - and also having pluggable storage seems like something that we could tweak over time as well if we have specific needs of our own based on our use cases). Bevy's documentation isn't as nice as legions' but the docs.rs aren't too bad that we can work it out.

One thing I'm trying to work out when digging into Bevy's ECS is how to manually run the series of systems over entities and components?

There’s more documentation on the master branch (a bunch was merged in recently). I’ve been talking with some of the people involved in making each of the frameworks, and right now I think we should be looking at shipyard. As was explained to me, while Bevy’s pluggable storage provides sparse set storage, it doesn’t have the same properties as a fully sparse set backed ECS, which is being able to add/remove components and being able to delete entities without blocking the world.

One thing I'm trying to work out when digging into Bevy's ECS is how to manually run the series of systems over entities and components?

Also, for my own edification, how would the flow work within Quilkin? A traditional ECS would execute on each frame tick, and then match that tick to systems depending on how often they should run -- but we don't have a frame tick, it feels like to me like each system should run as a packet is received (maybe???) ? I'm not quite sure how that would work. What are your thoughts there?

Using shipyard’s API you would call world.run or world.run_with_data on receiving the packet, but I also want to point out that a tick based approach would also work here without async, you would just have a much higher tick rate because you’re not trying to match a display or a game simulation. It’s really no different than what tokio is doing under the hood at the moment since the async I/O APIs are poll rather than push based anyway.

XAMPPRocky commented 3 years ago

Also worth mentioning as some prior art is VPP, which @gilesheron brought up in another thread. Just from reading the documentation, it's essentially what we'd be doing here AFAICT.

https://wiki.fd.io/view/VPP/What_is_VPP%3F

markmandel commented 3 years ago

Since we also have people that aren't familiar with ECS, this is one of the article series I first learned about Entity Component Systems http://t-machine.org/index.php/2007/09/03/entity-systems-are-the-future-of-mmog-development-part-1/

Googling shows me there are lots more content out there that explain the concept back when I first started reading about the pattern 😄

This seems a bit more recent: https://ianjk.com/ecs-in-rust/ , and is likely easier to follow.

markmandel commented 3 years ago

Using shipyard’s API you would call world.run or world.run_with_data on receiving the packet

I saw that 👍🏻 Shipyard's documentation is really good 👍🏻 (I could never find the actual function on becy-ecs to run the series of systems).

a tick based approach would also work here without async

Sounds like something we should test and compare. But sounds like a definite path through.

The other thing I'm thinking through - how do you see us splitting up Worlds? I don't think we can do one big World instance - maybe we would do one (or several workers Worlds) for the local port, and then a World per Session? 🤔 This seems like something we might want to diagram up a bit before diving into code.

XAMPPRocky commented 3 years ago

This seems like something we might want to diagram up a bit before diving into code.

FWIW I'm still mostly using this diagram from #339 as the my mental model.

126319930-d55c05f6-23b7-4b67-b1a0-58a2aa098a1a

The other thing I'm thinking through - how do you see us splitting up Worlds? I don't think we can do one big World instance - maybe we would do one (or several workers Worlds) for the local port, and then a World per Session?

Well other than for logical complexity I do think we could do one world because with sparse storage, a mutable query on T would only block other systems querying T. So I think we should separate them logically and have two worlds (one for read, and one for write). That way we can have components that are shared between the two worlds that could have slightly different meaning based on the context. For example; you could have SocketAddr refer to the downstream clients in read and refer to the upstream server in write (not saying we should do that specifically, just an example of what could be done)

markmandel commented 3 years ago

So I think we should separate them logically and have two worlds (one for read, and one for write)

That makes sense. I was also debating having 1 world, with a component that indicated if something was read/write then Filters could batch both operations in a single loop, but I think your point re: components having different meanings depending on if it is read or write makes a lot of sense 👍🏻

This also leans me towards tick based model, so that Filter operations can be batched in a loop, and aren't an invocation for every packet.

The only other things I still can't quite work out is that a world can only have one world.run happening at any given point and time. Do we want to be processing packets concurrently? Each World can process Filters in parallel within the ECS system, but that means we are only able to process one packet at time for read, and the same for write, since they are only a single world.

🤔 does it make sense to have multiple worlds and distribute packets among them to also run each one concurrently? Just thinking about if someone has less filters than the number of cores on a machine, seems like we might be leaving some processing power behind?

WDYT?

XAMPPRocky commented 3 years ago

The only other things I still can't quite work out is that a world can only have one world.run happening at any given point and time. Do we want to be processing packets concurrently? Each World can process Filters in parallel within the ECS system, but that means we are only able to process one packet at time for read, and the same for write, since they are only a single world.

Are you sure about that? world.run's signature is &self, and in shipyard's documentation it says it will borrow whatever "View" you provide, which to me says that if you if you are sharing a world across multiple threads, as long as the views don't require exclusive access on a shared resource, they should run concurrently.

Either way though, I think the important step in having it be concurrent and parallel is separating the "processing" of the packets from packets entering and exiting. For example, we could have a Status component that determines the current state of the packet. When a packet enters the world it has the Fresh status, we then have a system that collects all of the fresh packets and marks them as Processing. The filter chain runs over all packets marked as Processing, either marking them as Drop or Pass, which determines whether the packet continues, and we have a final system that acts as the sink discarding Drop packets, and forwarding Pass packets.

enum Status {
    Fresh,
    Processing,
    Drop,
    Pass,
}

markmandel commented 3 years ago

Are you sure about that? Not 100% 😄 This was an assumption on how I've treated ECS systems in the past.

world.run's signature is &self, and in shipyard's documentation it says it will borrow whatever "View" you provide, which to me says that if you if you are sharing a world across multiple threads, as long as the views don't require exclusive access on a shared resource, they should run concurrently.

Ooh, that's a good point, and easy to test 👍🏻 SGTM.

markmandel commented 3 years ago

Oh to be clear, this looks good to me, and the plan for implementation looks good to me too. The only outstanding question is one of performance comparison, which we can test once we have the initial implementation in place.

@iffyio I know you said you weren't as familiar with ECS, did you have any outstanding thoughts/questions/etc?

iffyio commented 3 years ago

SGTM! nothing to add atm

googleforgames / quilkin