Closed liamwhite closed 3 months ago
eprof pointed me to this line. Some logging suggests there are massive amounts of binaries being created when uploading large frames. The surrounding logic (in this file and in Frame) is suboptimal. Ideally it should trickle enough bytes to construct the header, and then start collecting the payload length as iodata and then do a single conversion to avoid wasting memory.
Bandit will receive 1460 bytes in the handler on each transmission, the maximum allowed for TCP. Given a frame payload size of 1048576 bytes (1 MiB), this will result in 1048576/1460 = 719 calls. Each call will allocate a binary that is 1460 bytes longer than the previous. Applying the series formula for natural numbers, that should work out to (1460)(719)(719 + 1) / 2 bytes = 377906400 bytes of binaries allocated and garbage collected for receiving a 1MiB frame (though not necessarily all at the same time).
I built a demo of the iodata idea with very poor code quality - hopefully you can improve it. https://github.com/liamwhite/bandit/commit/2afa7d563297b834c841da8e32bf2b3a62dba5d3
With that I can get Last request: 5ms, 210 MB/s
in Firefox, and Last request: 4ms, 291 MB/s
in Chromium. So this is an all-around speedup and Bandit can easily beat Cowboy with this change.
I also investigated a bottleneck with unmasking large frames -- while less severe, unfortunately I think these may not be solvable without help from a NIF
Right now Frame.mask/2 uses a tail recursive function to perform the mask.
for <<h::32 <- words>>, into: <<>>, do: <<Bitwise.bxor(h, mask)::32>>
improved throughput a little bit over the tail recursive solution.:crypto.exor(words, :binary.copy(<<mask::32>>, size_in_words))
improved throughput a little bit over the generator solution at the expense of wasting a lot of memory.I cannot reliably get more than 300MB/s of throughput and suspect it probably isn't possible without dropping to native.
The need to work with binaries when pattern matching against frames is a big performance here; in my experience to date it hasn't been too bad, mostly since the underlying system calls return more than a single packet's worth of data per recv
call. I'm guessing that there's something to how FF sends on the client side that messes with that assumption, leading to exactly the performance issue you identify.
I like your fix! Could you open a PR for it and we can clean it up a bit & merge? Something similar would need to be done for HTTP/2 frames as well, but I think it makes sense to establish a pattern with WebSockets before copying it over there.
We've also mused about the idea of folding all of this upstream into ThousandIsland, whereby it could be implemented as a runtime configurable 'framer' module on a connection. More of a 2.0 thing though.
We've spent a bunch of effort golfing the performance of masking to about as good as we can make it at this point across all sizes and shapes of frames - I don't think that that's much of the issue here.
I don't think that that's much of the issue here.
My benchmarking suggests frame masking is by far the most significant remaining bottleneck to reaching higher throughput with socket frames. I'm open to data that would change my mind about that, of course.
I think we should focus on getting the iolist work in #390 landed then we can look at remaining perf issues. Truthfully it's been a while since I looked at the masking work. Maybe @moogle19 has some recollection?
Test machine is a Ryzen 9 5950X Elixir 1.15.7 Erlang/OTP 26 Linux 6.8.11
Cowboy is slightly slower than Bandit on Chromium, Bandit is much slower than Cowboy on Firefox. 100% repeatable. Follow-up of https://github.com/phoenixframework/phoenix_live_view/pull/3386.
Demo program