Open Geal opened 5 years ago
For the record, @Geal and I talked a bit out-of-band, about possible optimizations:
cranelift
to compile routing rules to (specialized) machine code.
This wouldn't supplant the trie vs. FSM discussion, as the code would be implementing essentially a datastructure, and FSMs and tries are suited to different optimizations.(2) is pretty exciting, because it means that additional, low-volume hostnames create very little overhead. For instance if a Sozu instance has hosts patate.com
, patate.io
, patate.sozu.io
, patatate.sozu.io
, patatatate.sozu.io
, patatatatate.sozu.io
, ... , with the vast majority of its traffic going to the pa(ta){n}te.sozu.io
(patate
with ta
repeated N times), we would do:
.io
, then sozu
, then pa(ta){N}te
.It's a pretty pathological example, but I made it up to showcase how bad slowdown in tries can be. :3
@Geal I think it would make sense to discuss first what's the expressive power that the configuration language needs to support.
For instance, if you want regexps as a first-class citizen (rather than as “after trie” rules), compiling to FSMs will likely be the happy path. On the other hand, tries support some useful optimizations that might not be so easy with FSMs, like “vectorized” matching (i.e. represent a string prefix as a u64
, compare up to 8 characters at a time).
Same for “wildcard domain with a path constraint”: it's doable in both FSM and tries, but there's some hairy logic involved and it might be expensive (imagine we check /.well-known/acme-challenges
first thing on every single query). OTOH, we might be able to optimize the overhead away if we know that path is very seldomly taken?
@Geal OK, now I got nerdsniped into reading a whole bunch of automata theory papers to see if people have done multi-character, sublinear matching. (i.e. can we have all the optimizations that work on tries, on FSMs?)
I'm not fixed on the expressive power we want here, there will be a tradeoff between how flexible it can be, the speed it can run at (I'm not too worried about memory usage, considering how many applications we're currently handling at Clever Cloud), and the cost of modifying the data structure or code: workers are single threaded, so any time spent changing the structure is time spent not handling request, and the routing structure is not something that I would keep in an Arc<Mutex<>>
because of synchronization costs.
@nbraud I'm quite interested in those papers as well, if you see something interesting, could you post it here?
@Geal I would have expected that the routing structure can be kept in an RwLock<RefCell<T>>
, and the thread that computes the new structure can keep read-only access (concurrently to all threads handling queries) while computing the new one, and then go routing.write()?.replace(new_routing)
or somesuch, so the service interruption is only the time to write a pointer (plus sync overhead)
I spent a bit of time to clean up the trie benchmarks and make it easier to compare them.
As a side note to my original message: the routing trie matches on the domain, and returns as value a Vec
of HttpFront
. We then test path prefixes in this Vec
one by one. This is generally fast enough because there are not that many path prefix rules to test.
It could make sense for the new design to reuse this idea, ie having an optimized algorithm to reduce the search space, then apply some rules linearly. (the annoying part is when one of those rules is shared by a lot of applications, like ACME challenges)
another parameter on which we could play: I'd be ok with generating something in the master process, then sending it to the workers, if the generation process takes some time and we want to do it on each update.
There will be another important detail with replacement solutions like this one: if we want to keep an intermediate step (example: we already evaluated the URI and host but are waiting for other headers), then the struct would need to be kept in a Rc
or something like that, to make sure we're not replacing a structure that's still in use by one session.
@Geal Would if make sense to have the trie (or fst) match on the domain and return a trie/fst that matches the path and returns an HttpFront
? That might make things easier if/when adding regexp conditions.
Also, the promised braindump on possible optimizations and related work (caveat lector: it's 7.30am here, and I've been up since 2am):
u64
containing 8 packed u8
s): A hybrid multiple-character transition finite-automaton for string matching enginefst
(or publishable as an auxiliary crate, I guess).Not sure the linked papers are the best or most useful, but that's what I found when searching for the ideas I had in mind; feel free to link other things if you fear I will run out of reading material ;) (I had other papers opened, but it was more about bit-parallel evaluation of NFAs and I can't recall offhand why I thought that would be relevant.)
I'm interested in a way to represent a trie a bit like fst
is doing. Right now there are allocations in each node, but fst
has a compact representation in a byte stream, with offsets indicating jumps i the stream.
It's slower to edit (it requires regenerating part or all of the stream) but it's very small and fast to go through.
I added an example with fst
in https://github.com/sozu-proxy/trie_benches
The result is that generating the state machine takes x4 more time than generating a trie, and we would have to generate a new fst every time we change the routing.
I've seen it take 4 to 6 times longer than the trie when looking up existing domains, but it gets slightly faster when the domain is not in the list.
I was talking with Guillaume Logerot, who works in the same coworking space I'm in and knows a bit about text search and FSM. Here's what he suggested:
<reversed domain><SEP><path prefix>
. We would probably be able to express more interesting matches like alternatives or globs, and that would put path matching in the same routing structureI have spent some time testing various solutions, the results are available in the readme of https://github.com/sozu-proxy/trie_benches
To sum up:
There's one solution that emerged as simple enough to maintain, fast enough and flexible enough, but it's an ugly idea compared to the others 😄
We separate the domain in labels (www.example.com
-> [com, example, www]
) and build a tree of hashmaps: each node contains a hashmap of the next level's labels. Regex and wildcard matching can be done easily at the same level: if the hashmap does not have the entry, test regexps or wildcard if they're available (no backtracking like the specialized trie solution). URL matching can be done in a separate way:
To accomodate other use cases, I also plan to add "pre" and "post" rules that could be tested one by one, and might have more flexible routing rules than what is allowed in the main routing structure.
This way, we will be able to handle common use cases (like the ACME protocol one), allow for sites with high traffic but few rules (linear rules in "pre"), handle the hosting use case (lots of domains, few URL matching per domain), and the microservices use case (few domains, lots of URL matches). That last one will take a small hit compared to matching the whole domaine at once due to the hashmap tree, but it's manageable.
/cdn[0-9]+/.example.com
. Restricting them to domain labels should be flexible enough~
? We could also modify the Frontend
structure to have either a path prefix or a path regexpRouter
structure that will take hostname and URL as arguments and test "pre" rules, routing tree matching, and "post" rulesFrontend
structure to reject if matched: we might want to reject some requests directly at the load balancer instead forwarding them to the backend applicationI am not decided yet on how the "pre" and "post" rules can apply, especially considering that matching on headers is tricky in sozu: we do not assume we will get the whole request in on go, and do not build an internal representation of the request. Instead, sozu tries to match the route as soon as it has the URL and host, and for the rst of the headers, keep as few information as possible, and just knows if the data should be transmitted directly to the backend or not.
pending work on #585 and #586
the routing rewrite is done, available in the routing
and 0.12
branches.
this will go in the next release
The current trie based router is useful and reasonably fast, but is not flexible enough. Let's think about where we can go from here, what new features we want, etc.
current solution
we have two tries, one to match a reversed domain (so that we can build a suffix tree) to a
Rc<Certificate>
(because a certificate can match multiple domains, wildcards, etc) for SNI, and one to match a reversed domain to an application ID (an application can match multiple domains, wildcards, and we can also match on a prefix of the URI). We have different tries because there is not a 1<->1 match between certificates and applications, and they can be modified separately.Those tries have been optimized a bit to branch on specific characters (various implementations are available at https://github.com/sozu-proxy/trie_benches ) and incremental updates (add/remove, coalescing nodes, etc).
current issues
it matches on the domain first
In the case of ACME support, we need to add an application matching the
/.well-known/acme-challenge
path for every domain. With the current sozu routing, we need to add a new frontend for each domain.only way to match are on domain, domain suffix or path prefix
We might want more flexible matching options, like subdomain element matching a regex, or matching on the presence or value of a Header, source IP, etc.
one trie per
TcpListener
so if we listen on multiple HTTP ports, or multiple HTTPS ports, we have one trie per port. It's not necessarily a big issue, since often we'll only have port 80 and 443, and they do not always have the same routes.
Additional needs
blacklists/whitelists
The load balancer is in a good position to filter requests, along with routing. It could prevent requests to unknown routes from going to the backend server.
proposals
I am not decided yet on how to proceed, so let's propose anything that could be interesting there.
pre and post list of rules
Having a list of rules that are very flexible, that can be tried before or after the trie is tested.
representing the router as a finite state machine
The fst crate got some interesting results, maybe there's something we could reuse there? https://blog.burntsushi.net/transducers/