Refactor header parsing and data structure

bgreni commented 2 months ago

@saviorand I figured I might post a draft PR for this so I can see what you think about my approach before I get too much deeper into applying the changes and doing more thorough unit testing

Main points to review:

Replaced RequestHeader and ResponseHeader with a common Headers struct
- Headers is essentially a Dict wrapper with parsing logic included
- The first line info (method, protocol, etc) have been extracted out into HTTPRequest and HTTPResponse
Replace use of gojo Reader with an internal implementation. I have opinions about the usage of gojo in general in terms of mixing Go style semantics into a Mojo codebase, but also for something as performance sensitive as request parsing we probably want full control of it to make specific optimizations later?

Obviously this isn't done yet so it doesn't mean much but so far some light benchmarking I've done shows significant performance improvements

BEFORE:
Benchmark Report (ns)
---------------------
Mean: 30527.143607311646
Total: 2441622000.0
Iters: 79982
Warmup Mean: 54500.0
Warmup Total: 109000.0
Warmup Iters: 2
Fastest Mean: 30527.143607311646
Slowest Mean: 30527.143607311646

AFTER:
Benchmark Report (ns)
---------------------
Mean: 1626.186904822687
Total: 2356927000.0
Iters: 1449358
Warmup Mean: 7000.0
Warmup Total: 14000.0
Warmup Iters: 2
Fastest Mean: 1626.186904822687
Slowest Mean: 1626.186904822687

Fixes: #44

saviorand commented 2 months ago

@bgreni this is very impressive!! Love to see how the ~940 line header implementation is condense to ~80 lines. And the code now feels much more pythonic, too.

Definitely onboard with moving towards a more "idiomatic Mojo" approach, gojo was helpful to move fast but ideally I'd like to gradually refactor the whole codebase to a more Mojo-native code style. Although how "idiomatic" is defined in Mojo is still uncertain -- but I think there stdlib code and the Python ecosystem could be a good reference.

Also good point about the importance of control; I'm a bit surprised by the performance increase though, intuitively I'd expect a Dict to be slower than directly accessing a struct. Do you think the performance gain is due to switching to a different Reader implementation, or am I missing something?

bgreni commented 2 months ago

Although how "idiomatic" is defined in Mojo is still uncertain

Yeah the idiomatic ways of doing things in Mojo is unfortunately still evolving, an example being that using Formatter is currently the idiomatic way to write out long strings like we do during the request encoding, but frankly it might not stay that way. Personally from my experience contributing to the stdlib I think staying as Pythonic as possible is desirable since thats what mojo target audience is expecting/used to.

Also good point about the importance of control; I'm a bit surprised by the performance increase though, intuitively I'd expect a Dict to be slower than directly accessing a struct. Do you think the performance gain is due to switching to a different Reader implementation, or am I missing something?

I think it comes down to skipping the work of actually parsing the fields, the original code does a lot string comparisons whereas now we simply split on the colon char and dump the key/value into a dict. And note in this case we're measuring in nanoseconds so the difference isn't that big, but it could be meaningful for server-side performance.

bgreni commented 2 months ago

@saviorand I've just pushed some new changes, let me know what you think!

Applied changes to header, request, and response structs to places that depend on them
- Applied similar changes to URI as well
Updated tests
Updated benchmarks copying the pattern currently frequently used in the stdlib

I've done away with most of the String/Bytes conversions since String has some fun optimizations like using vectorized operations for string comparison and heavily simplifies the interfaces.

I haven't touch PythonClient yet since I'm actually not sure I understand what it's for?

saviorand commented 2 months ago

@bgreni I added both the PythonClient and the PythonServer to Lightbug initially to have a working implementation quickly, they both use python libraries for e.g. socket operations instead of C calls. Still keeping them around in case someone would prefer to use Python-based client/server instead of the default one, but they're already a bit outdated. Feel free to ignore them!

Re: the string vs bytes, I understand there's a lot of work in the stdlib happening around string optimizations, curious what the performance difference is for the changes you made compared to previous implementation. Previously in an end-to-end benchmark with wrk (sending a bunch of requests at a server within a given timeframe) working with bytes directly was faster, but could well be that this changed now. I can try benchmarking myself, not to distract you. Thanks for the updates!

bgreni commented 2 months ago

@saviorand Cool thanks for bringing wrk to my attention, I'm getting similar req/s numbers it seems, but the latency numbers with my changes are a bit concerning, let me see if I can work those down a bit

Using wrk -t12 -c12 -d10s http://127.0.0.1:8080/ against the server in lightbug.🔥

Before:
  12 threads and 12 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     0.95ms    5.71ms 100.00ms   99.13%
    Req/Sec     1.44k   768.31     1.97k    73.68%
  16353 requests in 10.09s, 24.94MB read
Requests/sec:   1620.35
Transfer/sec:      2.47MB

After:
Running 10s test @ http://127.0.0.1:8080/
  12 threads and 12 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     7.59ms   43.30ms 400.83ms   96.88%
    Req/Sec     1.63k   735.81     2.21k    83.84%
  16342 requests in 10.09s, 24.92MB read
Requests/sec:   1619.67
Transfer/sec:      2.47MB

EDIT: Ok upon further testing I'm getting results like this sometimes so I'm not sure what the deal but it could be reasonable to claim the performance before/after is comparable?

Running 10s test @ http://127.0.0.1:8080/
  12 threads and 12 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   430.73us   78.15us   2.75ms   97.75%
    Req/Sec     1.86k   521.93     2.17k    81.82%
  16344 requests in 10.09s, 24.92MB read
Requests/sec:   1619.60
Transfer/sec:      2.47MB

bgreni commented 2 months ago

@bgreni I added both the PythonClient and the PythonServer to Lightbug initially to have a working implementation quickly, they both use python libraries for e.g. socket operations instead of C calls. Still keeping them around in case someone would prefer to use Python-based client/server instead of the default one, but they're already a bit outdated. Feel free to ignore them!

With these changes they don't build anymore so in that case maybe we have to remove them entirely?

saviorand commented 2 months ago

@bgreni sure, I can remove the Python stuff, will push some changes here tomorrow. For benchmarking I could also recommend running against bench.mojo, since lightbug.🔥 serves an image and is a bit less representative of a usual plaintext/json output one would expect from a web server . The numbers there are a bit higher overall though; for example, when we were running the benchmark against the handler in the version of Lightbug of June this year, the results were:

wrk -t1 -c1 -d1s http://0.0.0.0:8080/
Running 1s test @ http://0.0.0.0:8080/
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    33.68us   11.31us 368.00us   96.26%
    Req/Sec    29.48k     2.38k   30.59k    90.91%
  32251 requests in 1.10s, 5.04MB read
Requests/sec:  29320.58
Transfer/sec:      4.59MB

Could be a good baseline to compare against . Or we can just try running the same against main, -t1 -c1 is because there is no concurrency yet in Lightbug so it's just one connection at a time

bgreni commented 2 months ago

@saviorand Ah looks like part of the problem was I didn't notice the default Connection header for HTTPResponse is expected to be keep-alive in the server so tcp_keep_alive wasn't having any affect with my changes. After fixing that here's some preliminary results on my machine. My results with the existing implementation were similar to what you posted.

Running 1s test @ http://0.0.0.0:8080/
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    28.67us    4.61us 199.00us   87.30%
    Req/Sec    33.96k     1.53k   35.05k    90.91%
  37186 requests in 1.10s, 5.82MB read
Requests/sec:  33807.79
Transfer/sec:      5.29MB

saviorand commented 2 months ago

@bgreni that's great! Marked increase to the previous performance 👏 👏 I think changing the logic to not go through the whole parsing is also completely justified.

bgreni commented 2 months ago

@saviorand Ok awesome, I'll do some housekeeping (proper comments and such) and undraft the PR when it's ready

bgreni commented 2 months ago

Added more benchmark cases
Moved bench server into bench_server.mojo
Normalize header keys to lower case to make them case-insensitive
Removed Python client/server implementation
Added more magic tasks for running tests and benchmarks

saviorand commented 2 months ago

@bgreni everything looks great, I can feel the quality of this codebase improving with every commit you're making ! Looks like you were faster than me with removing the Python Client/Server, thanks for that. Pushed a commit to enable the client test which is just a simple request to httpbin; related code wasn't compiling previously due to the PythonClient being outdated. Thinking I'll also run mojo format for all files now, see if there are any changes

bgreni commented 2 months ago

@bgreni everything looks great, I can feel the quality of this codebase improving with every commit you're making !

Thank you for indulging me on this refactor, I'm glad you like the result!

saviorand commented 2 months ago

@bgreni pushed one more fix for the client.mojo and updated the README, ready to merge

saviorand commented 2 months ago

Merged! Huge thanks again for the contribution 🔥

saviorand / lightbug_http

Refactor header parsing and data structure #61