add benchmark tooling for rama-cli and profile rama

plabayo / rama

modular service framework to move and transform network packets

https://ramaproxy.org

Apache License 2.0

187 stars 21 forks source link

add benchmark tooling for rama-cli and profile rama #340

Open GlenDC opened 1 month ago

GlenDC commented 1 month ago

document this approach and references in the rama book
mention the current results
make follow up tasks for areas where we need to improve prior to releasing 0.2

Current request throughput is pretty sloppy... Some benchmarks that came in measure at <400req/sec... That's embarrassingly slow.

GlenDC commented 1 month ago

For now we can just benchmark:

http/1.1+h2 MITM Proxy (with CA cert)
http/1.1+h2 Echo Server

as these scenarios are probably going to be the best ones supported for v0.2.

GlenDC commented 1 month ago

I forgot that we in the current http-backend require a Mutex for the http client, e.g.:

#[derive(Debug)]
// TODO: once we have hyper as `rama_core` we can
// drop this mutex as there is no inherint reason for `sender` to be mutable...
pub(super) enum SendRequest<Body> {
    Http1(Mutex<hyper::client::conn::http1::SendRequest<Body>>),
    Http2(Mutex<hyper::client::conn::http2::SendRequest<Body>>),
}

I bet that this is already a big explanation on why it is so slow. For sure there is other stuff that can be improved, haven't profiled yet. But let's first get the fork+embed work of hyper going and done, so that we can start from a benchmark without this mutex still in place. As that will then no longer be required.

GlenDC commented 1 month ago

To test the theory I ran against a rama-based http server. And yeah it works a lot faster... Still not as fast as I would hope, but this is better. We can circle back into this issue after hyper migration has happened.

GlenDC commented 2 weeks ago

Started doing some profiling. Seems not as much to do with the Mutex (which we no longer do for h2 but only for h1). Seems that a lot of time is spend because we just use a connection for 1 request, this is costly as it means setting up the entire tls stuff...

Connection pooling is gonna have to be done in 0.3 for sure, and decently so. After that is done we can also see what we can improve around the TLS usage.