Very slow WebSocket binary frame parsing with Firefox client

liamwhite commented 3 months ago

Test machine is a Ryzen 9 5950X Elixir 1.15.7 Erlang/OTP 26 Linux 6.8.11

Firefox 126.0 (Bandit)	Chromium 125.0.6422.141 (Bandit)	Firefox 126.0 (Cowboy)	Chromium 125.0.6422.141 (Cowboy)

Cowboy is slightly slower than Bandit on Chromium, Bandit is much slower than Cowboy on Firefox. 100% repeatable. Follow-up of https://github.com/phoenixframework/phoenix_live_view/pull/3386.

Demo program

Mix.install([:bandit, :websock_adapter])

defmodule ParseDemoServer do
  def init(options) do
    {:ok, options}
  end

  def handle_in({_data, [opcode: :binary]}, state) do
    {:reply, :ok, {:text, "ok"}, state}
  end

  def terminate(_reason, state) do
    {:ok, state}
  end
end

defmodule Router do
  use Plug.Router

  plug Plug.Logger
  plug :match
  plug :dispatch

  get "/" do
    conn
    |> put_resp_content_type("text/html")
    |> send_resp(200, """
    <!DOCTYPE html>
    <html>
    <body>
    <p id="performance"></p>
    <script>
    const size = 1 << 20; // 1 MiB
    async function nonCryptographicRandomBlock() {
      const key = await crypto.subtle.generateKey(
        {
          name: 'AES-CTR',
          length: 128
        },
        false,
        ['encrypt']
      );

      const buffer = new Uint8Array(size);
      const counter = new Uint8Array(16);

      return crypto.subtle.encrypt(
        {
          name: 'AES-CTR',
          counter: counter,
          length: 64
        },
        key,
        buffer
      );
    }

    const socket = new WebSocket('ws://localhost:4000/websocket');
    const performanceIndicator = document.getElementById('performance');

    let messageStart = performance.now();

    async function sendMessage() {
      const difference = performance.now() - messageStart;
      const throughput = (size / (difference * 1000));
      performanceIndicator.innerText = `Last request: ${Math.round(difference)} ms, ${Math.round(throughput)} MB/s`;

      socket.send(await nonCryptographicRandomBlock());
      messageStart = performance.now();
    }

    socket.addEventListener('open', sendMessage);
    socket.addEventListener('message', sendMessage);
    </script>
    </body>
    </html>
    """)
  end

  get "/websocket" do
    conn
    |> WebSockAdapter.upgrade(ParseDemoServer, [], timeout: 60_000)
    |> halt()
  end

  match _ do
    send_resp(conn, 404, "not found")
  end
end

require Logger
webserver = {Bandit, plug: Router, scheme: :http, port: 4000}
{:ok, _} = Supervisor.start_link([webserver], strategy: :one_for_one)
Logger.info("Plug now running on localhost:4000")
Process.sleep(:infinity)

liamwhite commented 3 months ago

https://github.com/mtrudel/bandit/blob/6dd82d9625ddce43685f91f8ab5a3150e1fd6a61/lib/bandit/websocket/handler.ex#L38

eprof pointed me to this line. Some logging suggests there are massive amounts of binaries being created when uploading large frames. The surrounding logic (in this file and in Frame) is suboptimal. Ideally it should trickle enough bytes to construct the header, and then start collecting the payload length as iodata and then do a single conversion to avoid wasting memory.

Bandit will receive 1460 bytes in the handler on each transmission, the maximum allowed for TCP. Given a frame payload size of 1048576 bytes (1 MiB), this will result in 1048576/1460 = 719 calls. Each call will allocate a binary that is 1460 bytes longer than the previous. Applying the series formula for natural numbers, that should work out to (1460)(719)(719 + 1) / 2 bytes = 377906400 bytes of binaries allocated and garbage collected for receiving a 1MiB frame (though not necessarily all at the same time).

liamwhite commented 3 months ago

I built a demo of the iodata idea with very poor code quality - hopefully you can improve it. https://github.com/liamwhite/bandit/commit/2afa7d563297b834c841da8e32bf2b3a62dba5d3

With that I can get Last request: 5ms, 210 MB/s in Firefox, and Last request: 4ms, 291 MB/s in Chromium. So this is an all-around speedup and Bandit can easily beat Cowboy with this change.

liamwhite commented 3 months ago

I also investigated a bottleneck with unmasking large frames -- while less severe, unfortunately I think these may not be solvable without help from a NIF

Right now Frame.mask/2 uses a tail recursive function to perform the mask.

Changing to a generator for <<h::32 <- words>>, into: <<>>, do: <<Bitwise.bxor(h, mask)::32>> improved throughput a little bit over the tail recursive solution.
Changing to :crypto.exor(words, :binary.copy(<<mask::32>>, size_in_words)) improved throughput a little bit over the generator solution at the expense of wasting a lot of memory.

I cannot reliably get more than 300MB/s of throughput and suspect it probably isn't possible without dropping to native.

mtrudel commented 3 months ago

The need to work with binaries when pattern matching against frames is a big performance here; in my experience to date it hasn't been too bad, mostly since the underlying system calls return more than a single packet's worth of data per recv call. I'm guessing that there's something to how FF sends on the client side that messes with that assumption, leading to exactly the performance issue you identify.

I like your fix! Could you open a PR for it and we can clean it up a bit & merge? Something similar would need to be done for HTTP/2 frames as well, but I think it makes sense to establish a pattern with WebSockets before copying it over there.

We've also mused about the idea of folding all of this upstream into ThousandIsland, whereby it could be implemented as a runtime configurable 'framer' module on a connection. More of a 2.0 thing though.

We've spent a bunch of effort golfing the performance of masking to about as good as we can make it at this point across all sizes and shapes of frames - I don't think that that's much of the issue here.

liamwhite commented 3 months ago

I don't think that that's much of the issue here.

My benchmarking suggests frame masking is by far the most significant remaining bottleneck to reaching higher throughput with socket frames. I'm open to data that would change my mind about that, of course.

mtrudel commented 3 months ago

I think we should focus on getting the iolist work in #390 landed then we can look at remaining perf issues. Truthfully it's been a while since I looked at the masking work. Maybe @moogle19 has some recollection?

mtrudel / bandit

Very slow WebSocket binary frame parsing with Firefox client #389