proto/h1: spurious copy-allocation in request parsing hotpath

hyperium / hyper

An HTTP library for Rust

https://hyper.rs

MIT License

14.08k stars 1.55k forks source link

proto/h1: spurious copy-allocation in request parsing hotpath #3574

Closed lucab closed 4 months ago

lucab commented 4 months ago

Version hyper-1.1.0

Platform Linux x86_64

Description

I was looking at the memory allocation patterns of a hyper-1.0 HTTP1 server with a fastwebsocket endpoint (basically equivalent to this example) and I noticed that there is a spurious memory copy-allocation in the uri.parse()? URI parsing logic here: https://github.com/hyperium/hyper/blob/00a703a9ef268266f8a8f78540253cbb2dcc6a55/src/proto/h1/role.rs#L157-L168

This is visible in memory profiles as a Bytes::copy_from_slice() coming from a Uri::from_str():

uri-parse

I suspect this is the root-cause of some performance impact that has been noticed in the CPU profiles at https://github.com/hyperium/hyper/issues/3258#issuecomment-1623809448.

lucab commented 4 months ago

After looking a bit into this request-parsing logic, it seems to me that the underlying problem is that the httparse crate does not make use of memory sharing through the bytes crate. It returns a fully parsed &str instead. So this codepath is not constructing an Uri from a shared Bytes, but doing a full memory copy instead.

I suspect this may require porting httparse to use the bytes crate for sharing the existing memory content.

seanmonstar commented 4 months ago

Ah, this was change actually was a performance improvement back when I did it: Bytes back then had an inline variant, and fitting / into that variant was faster than an atomic clone. Of course, Bytes no longer has that variant.

Changing it back isn't impossible, it'd be similar to how the headers are cloned: the &str can be turned into indices into the original buffer, and then we can slice from the Bytes.

seanmonstar commented 4 months ago

Would also be easy-ish to see if the pipeline benchmarks improve with the change.

lucab commented 4 months ago

@seanmonstar thanks for the hint, I didn't notice there was already some &str -> Bytes dance. I've followed the same approach for the path URI, PR at https://github.com/hyperium/hyper/pull/3575.

There are both a pipeline/hello_world_16 macro-benchmark and a bench_parse_incoming micro-benchmark related to this logic. I run several master-vs-PR benches and the numbers are comparable, here are the runs with the smallest variation intervals observed on my workstation:

# master
test hello_world_16                               ... bench: 12,762 ns/iter (+/- 1,526) = 155 MB/s
test proto::h1::role::tests::bench_parse_incoming ... bench: 1,558 ns/iter (+/- 56) = 563 MB/s

# PR-3575
test hello_world_16                               ... bench: 12,733 ns/iter (+/- 1,606) = 155 MB/s
test proto::h1::role::tests::bench_parse_incoming ... bench: 1,546 ns/iter (+/- 9) = 567 MB/s