Closed bgreni closed 2 months ago
@bgreni this is very impressive!! Love to see how the ~940 line header implementation is condense to ~80 lines. And the code now feels much more pythonic, too.
Definitely onboard with moving towards a more "idiomatic Mojo" approach, gojo was helpful to move fast but ideally I'd like to gradually refactor the whole codebase to a more Mojo-native code style. Although how "idiomatic" is defined in Mojo is still uncertain -- but I think there stdlib code and the Python ecosystem could be a good reference.
Also good point about the importance of control; I'm a bit surprised by the performance increase though, intuitively I'd expect a Dict to be slower than directly accessing a struct. Do you think the performance gain is due to switching to a different Reader implementation, or am I missing something?
Although how "idiomatic" is defined in Mojo is still uncertain
Yeah the idiomatic ways of doing things in Mojo is unfortunately still evolving, an example being that using Formatter
is currently the idiomatic way to write out long strings like we do during the request encoding, but frankly it might not stay that way. Personally from my experience contributing to the stdlib I think staying as Pythonic as possible is desirable since thats what mojo target audience is expecting/used to.
Also good point about the importance of control; I'm a bit surprised by the performance increase though, intuitively I'd expect a Dict to be slower than directly accessing a struct. Do you think the performance gain is due to switching to a different Reader implementation, or am I missing something?
I think it comes down to skipping the work of actually parsing the fields, the original code does a lot string comparisons whereas now we simply split on the colon char and dump the key/value into a dict. And note in this case we're measuring in nanoseconds so the difference isn't that big, but it could be meaningful for server-side performance.
@saviorand I've just pushed some new changes, let me know what you think!
URI
as well I've done away with most of the String
/Bytes
conversions since String
has some fun optimizations like using vectorized operations for string comparison and heavily simplifies the interfaces.
I haven't touch PythonClient
yet since I'm actually not sure I understand what it's for?
@bgreni I added both the PythonClient
and the PythonServer
to Lightbug initially to have a working implementation quickly, they both use python libraries for e.g. socket operations instead of C calls. Still keeping them around in case someone would prefer to use Python-based client/server instead of the default one, but they're already a bit outdated. Feel free to ignore them!
Re: the string vs bytes, I understand there's a lot of work in the stdlib happening around string optimizations, curious what the performance difference is for the changes you made compared to previous implementation. Previously in an end-to-end benchmark with wrk
(sending a bunch of requests at a server within a given timeframe) working with bytes directly was faster, but could well be that this changed now. I can try benchmarking myself, not to distract you. Thanks for the updates!
@saviorand Cool thanks for bringing wrk
to my attention, I'm getting similar req/s
numbers it seems, but the latency
numbers with my changes are a bit concerning, let me see if I can work those down a bit
Using wrk -t12 -c12 -d10s http://127.0.0.1:8080/
against the server in lightbug.🔥
Before:
12 threads and 12 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 0.95ms 5.71ms 100.00ms 99.13%
Req/Sec 1.44k 768.31 1.97k 73.68%
16353 requests in 10.09s, 24.94MB read
Requests/sec: 1620.35
Transfer/sec: 2.47MB
After:
Running 10s test @ http://127.0.0.1:8080/
12 threads and 12 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 7.59ms 43.30ms 400.83ms 96.88%
Req/Sec 1.63k 735.81 2.21k 83.84%
16342 requests in 10.09s, 24.92MB read
Requests/sec: 1619.67
Transfer/sec: 2.47MB
EDIT: Ok upon further testing I'm getting results like this sometimes so I'm not sure what the deal but it could be reasonable to claim the performance before/after is comparable?
Running 10s test @ http://127.0.0.1:8080/
12 threads and 12 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 430.73us 78.15us 2.75ms 97.75%
Req/Sec 1.86k 521.93 2.17k 81.82%
16344 requests in 10.09s, 24.92MB read
Requests/sec: 1619.60
Transfer/sec: 2.47MB
@bgreni I added both the
PythonClient
and thePythonServer
to Lightbug initially to have a working implementation quickly, they both use python libraries for e.g. socket operations instead of C calls. Still keeping them around in case someone would prefer to use Python-based client/server instead of the default one, but they're already a bit outdated. Feel free to ignore them!
With these changes they don't build anymore so in that case maybe we have to remove them entirely?
@bgreni sure, I can remove the Python stuff, will push some changes here tomorrow.
For benchmarking I could also recommend running against bench.mojo
, since lightbug.🔥
serves an image and is a bit less representative of a usual plaintext/json output one would expect from a web server . The numbers there are a bit higher overall though; for example, when we were running the benchmark against the handler in the version of Lightbug of June this year, the results were:
wrk -t1 -c1 -d1s http://0.0.0.0:8080/
Running 1s test @ http://0.0.0.0:8080/
1 threads and 1 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 33.68us 11.31us 368.00us 96.26%
Req/Sec 29.48k 2.38k 30.59k 90.91%
32251 requests in 1.10s, 5.04MB read
Requests/sec: 29320.58
Transfer/sec: 4.59MB
Could be a good baseline to compare against . Or we can just try running the same against main, -t1
-c1
is because there is no concurrency yet in Lightbug so it's just one connection at a time
@saviorand Ah looks like part of the problem was I didn't notice the default Connection
header for HTTPResponse
is expected to be keep-alive
in the server so tcp_keep_alive
wasn't having any affect with my changes. After fixing that here's some preliminary results on my machine. My results with the existing implementation were similar to what you posted.
Running 1s test @ http://0.0.0.0:8080/
1 threads and 1 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 28.67us 4.61us 199.00us 87.30%
Req/Sec 33.96k 1.53k 35.05k 90.91%
37186 requests in 1.10s, 5.82MB read
Requests/sec: 33807.79
Transfer/sec: 5.29MB
@bgreni that's great! Marked increase to the previous performance 👏 👏 I think changing the logic to not go through the whole parsing is also completely justified.
@saviorand Ok awesome, I'll do some housekeeping (proper comments and such) and undraft the PR when it's ready
bench_server.mojo
magic
tasks for running tests and benchmarks@bgreni everything looks great, I can feel the quality of this codebase improving with every commit you're making ! Looks like you were faster than me with removing the Python Client/Server, thanks for that.
Pushed a commit to enable the client test which is just a simple request to httpbin; related code wasn't compiling previously due to the PythonClient being outdated.
Thinking I'll also run mojo format
for all files now, see if there are any changes
@bgreni everything looks great, I can feel the quality of this codebase improving with every commit you're making !
Thank you for indulging me on this refactor, I'm glad you like the result!
@bgreni pushed one more fix for the client.mojo
and updated the README, ready to merge
Merged! Huge thanks again for the contribution 🔥
@saviorand I figured I might post a draft PR for this so I can see what you think about my approach before I get too much deeper into applying the changes and doing more thorough unit testing
Main points to review:
RequestHeader
andResponseHeader
with a commonHeaders
structHeaders
is essentially aDict
wrapper with parsing logic includedHTTPRequest
andHTTPResponse
Reader
with an internal implementation. I have opinions about the usage of gojo in general in terms of mixing Go style semantics into a Mojo codebase, but also for something as performance sensitive as request parsing we probably want full control of it to make specific optimizations later?Obviously this isn't done yet so it doesn't mean much but so far some light benchmarking I've done shows significant performance improvements
Fixes: #44