googleforgames / quilkin

Quilkin is a non-transparent UDP proxy specifically designed for use with large scale multiplayer dedicated game server deployments, to ensure security, access control, telemetry data, metrics and more.
Apache License 2.0
1.28k stars 92 forks source link

Evaluate parallel design for filter processing #411

Open XAMPPRocky opened 2 years ago

XAMPPRocky commented 2 years ago

We've been discussing over the past couple of months of moving Quilkin's internal architecture from our current iterative model (One context is created per packet, and we run the context through the filter chains until we have the result). To a parallel model, where filters run through all available packets at once, with apparent order (meaning if &mut X is needed by both the first and second filter in the chain, the the system will give access to the first filter, then the second, etc) when components are shared.

I initially proposed going in actor model, however I now think that the best approach to make this parallel is use an "Entity Component System" (ECS) to represent the processing pipeline. There are number of benefits that are specific to ECS that I think we should consider that the main proposal.

Benefits

Drawbacks

Implementation Plan

If we go with this architecture, I believe there is an approach to implementing this that will help offset some of the drawbacks mentioned.

  1. Finish #371.
  2. Move each filter over to quilkin test in their own PR.
    • This will remove work in migrating some of the tests over to ECS, as the end-to-end tests don't rely on Quilkin internals.
  3. Create a new ecs branch, with the initial changes that works for a single filter.
    • Open a new draft PR with it for review.
    • Once reviewed, we'd migrate over the filter implementations as PRs to the ecs branch.
    • Once all filter implementations have been reviewed and merged in the ecs branch, we'd merge that into main.

Unresolved Questions

markmandel commented 2 years ago

This is super interesting, and I'm also finding the idea of applying ECS here a quite fascinating use case.

One thing I've never been 100% clear on:

To a parallel model, where filters run through all available packets at once, with apparent order

What benefit are we aiming at getting with parallel processing of Filters over a byte packet? Are we expecting higher throughput through this model? Something else?

My original theory was that if you have multiple packets flowing through the proxy simultaneously, there's no net performance benefit to being able to Filter in parallel, as the system is going to be using every core to run packets in parallel (as opposed to Filters), so there was no real performance benefit to be had, at least in that aspect -- but I could well be missing something here, or the aim of this design is to optimise on something else (it does seem to me like it would reduce mutex locking?).

XAMPPRocky commented 2 years ago

What benefit are we aiming at getting with parallel processing of Filters over a byte packet? Are we expecting higher throughput through this model? Something else?

Well we get parallelism mostly for free with all ECS frameworks, we’ll of course have to measure for performance, but in the abstract running in parallel should reduce work for larger filter chains by allowing earlier termination of the chain.

Imagine a filter chain of a compress filter followed by a rate limit filter. With parallelism, the rate limit filter can flag the packet for deletion while the compress filter is still running. Obviously you wouldn’t want to put those two filters in that order, but it’s an example of different filters in the chain having different amounts work. So while it might only provide minimal to no performance benefits for an already optimised configuration, it provides a more intuitive user experience and better performance for naive configurations because there will be no performance difference for having different amounts work in the filter chains in different orders e.g. (compress, rate-limit) and (rate-limit, compress).

markmandel commented 2 years ago

but in the abstract running in parallel should reduce work for larger filter chains by allowing earlier termination of the chain.

Aaah yes. That makes a lot of sense 👍🏻 Awesome, that clarifies things for me nicely. Thanks!

We should probably write some of our own benchmarks (but from review, it does sound like Sparse Storage in Bevy ECS seems like a winner - and also having pluggable storage seems like something that we could tweak over time as well if we have specific needs of our own based on our use cases). Bevy's documentation isn't as nice as legions' but the docs.rs aren't too bad that we can work it out.

One thing I'm trying to work out when digging into Bevy's ECS is how to manually run the series of systems over entities and components?

Also, for my own edification, how would the flow work within Quilkin? A traditional ECS would execute on each frame tick, and then match that tick to systems depending on how often they should run -- but we don't have a frame tick, it feels like to me like each system should run as a packet is received (maybe???) ? I'm not quite sure how that would work. What are your thoughts there?

XAMPPRocky commented 2 years ago

We should probably write some of our own benchmarks (but from review, it does sound like Sparse Storage in Bevy ECS seems like a winner - and also having pluggable storage seems like something that we could tweak over time as well if we have specific needs of our own based on our use cases). Bevy's documentation isn't as nice as legions' but the docs.rs aren't too bad that we can work it out.

One thing I'm trying to work out when digging into Bevy's ECS is how to manually run the series of systems over entities and components?

There’s more documentation on the master branch (a bunch was merged in recently). I’ve been talking with some of the people involved in making each of the frameworks, and right now I think we should be looking at shipyard. As was explained to me, while Bevy’s pluggable storage provides sparse set storage, it doesn’t have the same properties as a fully sparse set backed ECS, which is being able to add/remove components and being able to delete entities without blocking the world.

One thing I'm trying to work out when digging into Bevy's ECS is how to manually run the series of systems over entities and components?

Also, for my own edification, how would the flow work within Quilkin? A traditional ECS would execute on each frame tick, and then match that tick to systems depending on how often they should run -- but we don't have a frame tick, it feels like to me like each system should run as a packet is received (maybe???) ? I'm not quite sure how that would work. What are your thoughts there?

Using shipyard’s API you would call world.run or world.run_with_data on receiving the packet, but I also want to point out that a tick based approach would also work here without async, you would just have a much higher tick rate because you’re not trying to match a display or a game simulation. It’s really no different than what tokio is doing under the hood at the moment since the async I/O APIs are poll rather than push based anyway.

XAMPPRocky commented 2 years ago

Also worth mentioning as some prior art is VPP, which @gilesheron brought up in another thread. Just from reading the documentation, it's essentially what we'd be doing here AFAICT.

https://wiki.fd.io/view/VPP/What_is_VPP%3F

markmandel commented 2 years ago

Since we also have people that aren't familiar with ECS, this is one of the article series I first learned about Entity Component Systems http://t-machine.org/index.php/2007/09/03/entity-systems-are-the-future-of-mmog-development-part-1/

Googling shows me there are lots more content out there that explain the concept back when I first started reading about the pattern 😄

This seems a bit more recent: https://ianjk.com/ecs-in-rust/ , and is likely easier to follow.

markmandel commented 2 years ago

Using shipyard’s API you would call world.run or world.run_with_data on receiving the packet

I saw that 👍🏻 Shipyard's documentation is really good 👍🏻 (I could never find the actual function on becy-ecs to run the series of systems).

a tick based approach would also work here without async

Sounds like something we should test and compare. But sounds like a definite path through.

The other thing I'm thinking through - how do you see us splitting up Worlds? I don't think we can do one big World instance - maybe we would do one (or several workers Worlds) for the local port, and then a World per Session? 🤔 This seems like something we might want to diagram up a bit before diving into code.

XAMPPRocky commented 2 years ago

This seems like something we might want to diagram up a bit before diving into code.

FWIW I'm still mostly using this diagram from #339 as the my mental model.

126319930-d55c05f6-23b7-4b67-b1a0-58a2aa098a1a

The other thing I'm thinking through - how do you see us splitting up Worlds? I don't think we can do one big World instance - maybe we would do one (or several workers Worlds) for the local port, and then a World per Session?

Well other than for logical complexity I do think we could do one world because with sparse storage, a mutable query on T would only block other systems querying T. So I think we should separate them logically and have two worlds (one for read, and one for write). That way we can have components that are shared between the two worlds that could have slightly different meaning based on the context. For example; you could have SocketAddr refer to the downstream clients in read and refer to the upstream server in write (not saying we should do that specifically, just an example of what could be done)

markmandel commented 2 years ago

So I think we should separate them logically and have two worlds (one for read, and one for write)

That makes sense. I was also debating having 1 world, with a component that indicated if something was read/write then Filters could batch both operations in a single loop, but I think your point re: components having different meanings depending on if it is read or write makes a lot of sense 👍🏻

This also leans me towards tick based model, so that Filter operations can be batched in a loop, and aren't an invocation for every packet.

The only other things I still can't quite work out is that a world can only have one world.run happening at any given point and time. Do we want to be processing packets concurrently? Each World can process Filters in parallel within the ECS system, but that means we are only able to process one packet at time for read, and the same for write, since they are only a single world.

🤔 does it make sense to have multiple worlds and distribute packets among them to also run each one concurrently? Just thinking about if someone has less filters than the number of cores on a machine, seems like we might be leaving some processing power behind?

WDYT?

XAMPPRocky commented 2 years ago

The only other things I still can't quite work out is that a world can only have one world.run happening at any given point and time. Do we want to be processing packets concurrently? Each World can process Filters in parallel within the ECS system, but that means we are only able to process one packet at time for read, and the same for write, since they are only a single world.

Are you sure about that? world.run's signature is &self, and in shipyard's documentation it says it will borrow whatever "View" you provide, which to me says that if you if you are sharing a world across multiple threads, as long as the views don't require exclusive access on a shared resource, they should run concurrently.

Either way though, I think the important step in having it be concurrent and parallel is separating the "processing" of the packets from packets entering and exiting. For example, we could have a Status component that determines the current state of the packet. When a packet enters the world it has the Fresh status, we then have a system that collects all of the fresh packets and marks them as Processing. The filter chain runs over all packets marked as Processing, either marking them as Drop or Pass, which determines whether the packet continues, and we have a final system that acts as the sink discarding Drop packets, and forwarding Pass packets.

enum Status {
    Fresh,
    Processing,
    Drop,
    Pass,
}
markmandel commented 2 years ago

Are you sure about that? Not 100% 😄 This was an assumption on how I've treated ECS systems in the past.

world.run's signature is &self, and in shipyard's documentation it says it will borrow whatever "View" you provide, which to me says that if you if you are sharing a world across multiple threads, as long as the views don't require exclusive access on a shared resource, they should run concurrently.

Ooh, that's a good point, and easy to test 👍🏻 SGTM.

markmandel commented 2 years ago

Oh to be clear, this looks good to me, and the plan for implementation looks good to me too. The only outstanding question is one of performance comparison, which we can test once we have the initial implementation in place.

@iffyio I know you said you weren't as familiar with ECS, did you have any outstanding thoughts/questions/etc?

iffyio commented 2 years ago

SGTM! nothing to add atm