Open XAMPPRocky opened 3 years ago
This is super interesting, and I'm also finding the idea of applying ECS here a quite fascinating use case.
One thing I've never been 100% clear on:
To a parallel model, where filters run through all available packets at once, with apparent order
What benefit are we aiming at getting with parallel processing of Filters over a byte packet? Are we expecting higher throughput through this model? Something else?
My original theory was that if you have multiple packets flowing through the proxy simultaneously, there's no net performance benefit to being able to Filter in parallel, as the system is going to be using every core to run packets in parallel (as opposed to Filters), so there was no real performance benefit to be had, at least in that aspect -- but I could well be missing something here, or the aim of this design is to optimise on something else (it does seem to me like it would reduce mutex locking?).
What benefit are we aiming at getting with parallel processing of Filters over a byte packet? Are we expecting higher throughput through this model? Something else?
Well we get parallelism mostly for free with all ECS frameworks, we’ll of course have to measure for performance, but in the abstract running in parallel should reduce work for larger filter chains by allowing earlier termination of the chain.
Imagine a filter chain of a compress filter followed by a rate limit filter. With parallelism, the rate limit filter can flag the packet for deletion while the compress filter is still running. Obviously you wouldn’t want to put those two filters in that order, but it’s an example of different filters in the chain having different amounts work. So while it might only provide minimal to no performance benefits for an already optimised configuration, it provides a more intuitive user experience and better performance for naive configurations because there will be no performance difference for having different amounts work in the filter chains in different orders e.g. (compress, rate-limit) and (rate-limit, compress).
but in the abstract running in parallel should reduce work for larger filter chains by allowing earlier termination of the chain.
Aaah yes. That makes a lot of sense 👍🏻 Awesome, that clarifies things for me nicely. Thanks!
We should probably write some of our own benchmarks (but from review, it does sound like Sparse Storage in Bevy ECS seems like a winner - and also having pluggable storage seems like something that we could tweak over time as well if we have specific needs of our own based on our use cases). Bevy's documentation isn't as nice as legions' but the docs.rs aren't too bad that we can work it out.
One thing I'm trying to work out when digging into Bevy's ECS is how to manually run the series of systems over entities and components?
Also, for my own edification, how would the flow work within Quilkin? A traditional ECS would execute on each frame tick, and then match that tick to systems depending on how often they should run -- but we don't have a frame tick, it feels like to me like each system should run as a packet is received (maybe???) ? I'm not quite sure how that would work. What are your thoughts there?
We should probably write some of our own benchmarks (but from review, it does sound like Sparse Storage in Bevy ECS seems like a winner - and also having pluggable storage seems like something that we could tweak over time as well if we have specific needs of our own based on our use cases). Bevy's documentation isn't as nice as legions' but the docs.rs aren't too bad that we can work it out.
One thing I'm trying to work out when digging into Bevy's ECS is how to manually run the series of systems over entities and components?
There’s more documentation on the master branch (a bunch was merged in recently). I’ve been talking with some of the people involved in making each of the frameworks, and right now I think we should be looking at shipyard
. As was explained to me, while Bevy’s pluggable storage provides sparse set storage, it doesn’t have the same properties as a fully sparse set backed ECS, which is being able to add/remove components and being able to delete entities without blocking the world.
One thing I'm trying to work out when digging into Bevy's ECS is how to manually run the series of systems over entities and components?
Also, for my own edification, how would the flow work within Quilkin? A traditional ECS would execute on each frame tick, and then match that tick to systems depending on how often they should run -- but we don't have a frame tick, it feels like to me like each system should run as a packet is received (maybe???) ? I'm not quite sure how that would work. What are your thoughts there?
Using shipyard’s API you would call world.run
or world.run_with_data
on receiving the packet, but I also want to point out that a tick based approach would also work here without async, you would just have a much higher tick rate because you’re not trying to match a display or a game simulation. It’s really no different than what tokio is doing under the hood at the moment since the async I/O APIs are poll rather than push based anyway.
Also worth mentioning as some prior art is VPP, which @gilesheron brought up in another thread. Just from reading the documentation, it's essentially what we'd be doing here AFAICT.
Since we also have people that aren't familiar with ECS, this is one of the article series I first learned about Entity Component Systems http://t-machine.org/index.php/2007/09/03/entity-systems-are-the-future-of-mmog-development-part-1/
Googling shows me there are lots more content out there that explain the concept back when I first started reading about the pattern 😄
This seems a bit more recent: https://ianjk.com/ecs-in-rust/ , and is likely easier to follow.
Using shipyard’s API you would call world.run or world.run_with_data on receiving the packet
I saw that 👍🏻 Shipyard's documentation is really good 👍🏻 (I could never find the actual function on becy-ecs to run the series of systems).
a tick based approach would also work here without async
Sounds like something we should test and compare. But sounds like a definite path through.
The other thing I'm thinking through - how do you see us splitting up Worlds? I don't think we can do one big World instance - maybe we would do one (or several workers Worlds) for the local port, and then a World per Session? 🤔 This seems like something we might want to diagram up a bit before diving into code.
This seems like something we might want to diagram up a bit before diving into code.
FWIW I'm still mostly using this diagram from #339 as the my mental model.
The other thing I'm thinking through - how do you see us splitting up Worlds? I don't think we can do one big World instance - maybe we would do one (or several workers Worlds) for the local port, and then a World per Session?
Well other than for logical complexity I do think we could do one world because with sparse storage, a mutable query on T
would only block other systems querying T
. So I think we should separate them logically and have two worlds (one for read, and one for write). That way we can have components that are shared between the two worlds that could have slightly different meaning based on the context. For example; you could have SocketAddr
refer to the downstream clients in read
and refer to the upstream server in write
(not saying we should do that specifically, just an example of what could be done)
So I think we should separate them logically and have two worlds (one for read, and one for write)
That makes sense. I was also debating having 1 world, with a component that indicated if something was read
/write
then Filters could batch both operations in a single loop, but I think your point re: components having different meanings depending on if it is read
or write
makes a lot of sense 👍🏻
This also leans me towards tick based model, so that Filter operations can be batched in a loop, and aren't an invocation for every packet.
The only other things I still can't quite work out is that a world can only have one world.run
happening at any given point and time. Do we want to be processing packets concurrently? Each World can process Filters in parallel within the ECS system, but that means we are only able to process one packet at time for read
, and the same for write
, since they are only a single world.
🤔 does it make sense to have multiple worlds and distribute packets among them to also run each one concurrently? Just thinking about if someone has less filters than the number of cores on a machine, seems like we might be leaving some processing power behind?
WDYT?
The only other things I still can't quite work out is that a world can only have one world.run happening at any given point and time. Do we want to be processing packets concurrently? Each World can process Filters in parallel within the ECS system, but that means we are only able to process one packet at time for read, and the same for write, since they are only a single world.
Are you sure about that? world.run
's signature is &self
, and in shipyard's documentation it says it will borrow whatever "View" you provide, which to me says that if you if you are sharing a world across multiple threads, as long as the views don't require exclusive access on a shared resource, they should run concurrently.
Either way though, I think the important step in having it be concurrent and parallel is separating the "processing" of the packets from packets entering and exiting. For example, we could have a Status
component that determines the current state of the packet. When a packet enters the world it has the Fresh
status, we then have a system that collects all of the fresh packets and marks them as Processing
. The filter chain runs over all packets marked as Processing
, either marking them as Drop
or Pass
, which determines whether the packet continues, and we have a final system that acts as the sink discarding Drop
packets, and forwarding Pass
packets.
enum Status {
Fresh,
Processing,
Drop,
Pass,
}
Are you sure about that? Not 100% 😄 This was an assumption on how I've treated ECS systems in the past.
world.run's signature is &self, and in shipyard's documentation it says it will borrow whatever "View" you provide, which to me says that if you if you are sharing a world across multiple threads, as long as the views don't require exclusive access on a shared resource, they should run concurrently.
Ooh, that's a good point, and easy to test 👍🏻 SGTM.
Oh to be clear, this looks good to me, and the plan for implementation looks good to me too. The only outstanding question is one of performance comparison, which we can test once we have the initial implementation in place.
@iffyio I know you said you weren't as familiar with ECS, did you have any outstanding thoughts/questions/etc?
SGTM! nothing to add atm
We've been discussing over the past couple of months of moving Quilkin's internal architecture from our current iterative model (One context is created per packet, and we run the context through the filter chains until we have the result). To a parallel model, where filters run through all available packets at once, with apparent order (meaning if
&mut X
is needed by both the first and second filter in the chain, the the system will give access to the first filter, then the second, etc) when components are shared.I initially proposed going in actor model, however I now think that the best approach to make this parallel is use an "Entity Component System" (ECS) to represent the processing pipeline. There are number of benefits that are specific to ECS that I think we should consider that the main proposal.
Benefits
ReadContext
andWriteContext
structs. Instead we'd haveRead
andWrite
"worlds" which contain the context as components associated which entity which is currently "in-flight".Arc<RwLock<T>>
, each filter would declare a query of what component types they want access to with mutability rules, and the ECS framework will resolve the order for us. This also resolves #104 without needing a new trait, as filters who's queries don't conflict will just run in parallel.dynamic_metadata
abstraction (closing #305), as filters could insert new types as components, and other filters could query for those filter specific types, rather than needing to a lookup and cast fromdyn Any
.Drawbacks
Filter
trait, which will subsequently require updating the existing filter implementations, as well as some of the proxy server's logic.Implementation Plan
If we go with this architecture, I believe there is an approach to implementing this that will help offset some of the drawbacks mentioned.
quilkin test
in their own PR.ecs
branch, with the initial changes that works for a single filter.ecs
branch.ecs
branch, we'd merge that intomain
.Unresolved Questions
bevy-ecs
,legion
, andshipyard
. In my current research it seems like Bevy's ECS with sparse set storage is the fastest for the type workload we want, where we want to create, process, and destroy entities as soon as possible (see https://github.com/bevyengine/bevy/pull/1525 for some benchmark comparisons)