Improve use of iodata - Githubissues

mtrudel commented 2 years ago

There are a number of places (particularly in the h2 stack) where we're building up buffers based on iodata input and end up needing to flatten them to binaries earlier than is optimal. Mostly this comes down to the fact that there is no way to grab the first n bytes of an iodata without resorting to pattern matching:

# Given data like...
data = ["a", ["b", [[], "c",......,"z"]

# Instead of this...
<first_three_bytes::binary-size(3), rest::binary> = IO.iodata_to_binary(data)

# It would be good to do this (or something similar)....
{first_three_bytes, rest} = IO.split_iodata(data, 3)

Some provisional work for this is at https://github.com/mtrudel/iolist_test, but it's something around 8x slower than binary approaches. I'm assuming that this is mostly because binary conversion / matching is done in C within the BEAM, while my naive iolist splitting heuristic is being done in Elixir. To my eye there's strictly less work to be done in the iolist splitting scenario, so if it were implemented in C I imagine it would be faster / more memory efficient than the binary conversion / matching approach.

It would be useful to understand if / to what extent this is the case, and whether improved iolist primitives in the BEAM would be worth it.

sabiwara commented 2 years ago

Hi! I had a look into the test repo, and tried to code an alternative implementation that "just" recursively walks the IO list and doesn't build intermediate tuples (and probably reduces function calls a bit). It seems to speed up things compared to the current implementation, but is still slower than the BIF :sweat_smile:

To my eye there's strictly less work to be done in the iolist splitting scenario, so if it were implemented in C I imagine it would be faster / more memory efficient than the binary conversion / matching approach.

I think in this scenario we are paying the IO data => binary conversion only once for the first chunk, and every subsequent split is very efficient. If we use iolist splitting, we need to keep working with iolists in the rest of the loop. From a very quick trial with (much) smaller iolists, the benchmarks seem to favor iolist splitting over the BIF, but this would need to be studied into more detail.

mtrudel commented 2 years ago

I like your approach - you're trading off building intermediate tuples with more stack work, but that's likely a good compromise to make.

In terms of converting to a binary, I'm pretty sure we have to do that for at most one binary per call. To wit:

The intent here is to allow bandit's http2 implementation to 'peel off' some leading number of bytes from a Plug-generated iolist response (as at https://github.com/mtrudel/bandit/blob/main/lib/bandit/http2/connection.ex#L420). The way this function is implemented now, any response which exceeds the size of a data frame or the send window size is flattened to a binary for the sole purpose of getting the first n bytes. This is obviously sub-par as it precludes using iolists for the entirety of the path from Plug logic through to :gen_tcp/:ssl calls. As a consequence, it's fine for split_binary to return an iolist for both parts of the split (ie: the 'head' and the 'rest').
This experiment tries to model different ways of accomplishing this. Ultimately though, it's about moving parts of the starting input into 'head' until it would otherwise be longer than required, and then splitting that last iolist entry which would cause that size to be exceeded into two such that 'head' is exactly the length required.
The only binary 'flattening' which would be required here is with whatever binary in the iolist causes this length to be exceeded. Everything else is just a matter of juggling iolist entries around (the exact way to do this juggling is up for debate I think; I'm not well versed enough in how ERTS handles list copies to have a deep opinion here).

mtrudel commented 2 years ago

I've pushed a revised version of my split algorithm at https://github.com/mtrudel/iolist_test/commit/b6dc78593678404725d4385c305cb26d31fc43c2 which is slightly faster but still slower than binary splitting. Comments welcome.

sabiwara commented 2 years ago

The intent here is to allow bandit's http2 implementation to 'peel off' some leading number of bytes from a Plug-generated iolist response

Oh OK, I was missing some context. Due to this comment I was under the impression we wanted to split a big IO list into chunks, with iolist_length >>> chunk_size, hence my reflexion. I suppose the first chunk will be slower in the binary case, but all subsequent chunks would be very fast. If we just want the first chunk + the rest as an iolist though, I suppose split_iolist has a good chance to be faster?

Maybe we should try to bench HTTP2 using it to get a more real-world comparison?

I've pushed a revised version of my split algorithm at mtrudel/iolist_test@b6dc785 which is slightly faster but still slower than binary splitting. Comments welcome.

Looks neat, pretty easy to understand! I like how you use IO.iodata_length to check the nested iolist and add the whole thing without a need to loop on if small. I can imagine some pathological cases like [[[[[big_iolist | ...] | ...] |... where you might end up doing more work, calling iodata_length (in C, but linear) more than needed. This might happen if an iolist is being built using iolist = [iolist | "foo"]. But if this is not a thing with real-world payloads, your approach is probably much faster in practice.

mtrudel commented 2 years ago

This is the thing - obviously the intent here is to make real-world iolists faster; I suspect I could get some good sample data by grabbing a rendered page from EEx as it gets sent via Bandit (to see it in its original iolist form). I'm going to wire that up into a couple of tests.

wojtekmach commented 2 years ago

regarding iodata splitting, this might be worth checking out too: https://github.com/ninenines/cowlib/blob/0f5c2f8922c89c58f51696cce690245cbdc5f327/src/cow_iolists.erl.

chrooti commented 2 years ago

Hello! I tried the following approach and it seems to be often faster than the native one. I hope I haven't missed any pathological case. It seems to consume a lot less memory, too, but I'm afraid it's a bug somewhere in the calculation.

defmodule IOListSplit3 do
  @doc """
  Returns an iolist consisting of the first `length` bytes of the given list, along with the
  remainder of the iolist beyond it as a tuple. Returns `{:error, :length_exceeded}` if there are
  not `length` bytes remaining in the iolist.
  """
  def split(iodata, length) do
    if IO.iodata_length(iodata) >= length do
      do_split(iodata, length)
    else
      {:error, :length_exceeded}
    end
  end

  defp do_split([head | tail], length)
    when is_binary(head) and length != 0
  do
    head_size = byte_size(head)
    if head_size <= length do
      {tail_head, tail_tail} = do_split(tail, length - head_size)
      {[head | tail_head], tail_tail}
    else
      <<head::binary-size(length), rest::binary>> = head
      {head, [rest | tail]}
    end
  end

  defp do_split(iodata, length)
    when is_binary(iodata) and length != 0
  do
    <<head::binary-size(length), rest::binary>> = iodata
    {head, rest}
  end

  defp do_split(iodata, 0) do
    {iodata, []}
  end

  defp do_split([[head_head | head_tail] | tail], length) do
    case head_tail do
      [] -> do_split([head_head | tail], length)
      _ -> do_split([head_head | [head_tail | tail]], length)
    end
  end
end

mtrudel commented 2 years ago

@chrooti - I copied your implementation into my iolist_test benchmarking harness at https://github.com/mtrudel/iolist_test/commit/4f6aa0e373081f7df9d29ffd5dc038f762d60955, and it seems to come out roughly in line with my current best take on the algorithm. Have a look over there and see if I missed anything!

chrooti commented 2 years ago

So I checked a bit your implementation (thanks for reworking it into something actually competitive) and a few things struck me:

unless I imagined things, with your previous iteration of the benchmark (not sure what changed) I was actually getting decent numbers with my very naive version, but I must have messed something up
with your rework I see IO.iodata_length called more than 1 time for iteration, while my idea was to reduce the calls to that function to 1 (I wanted to set up some tracing but the results were sooo promising)

I have not had much free time to experiment with these due to other projects taking my attention, but I thought I'd drop by and at least thank you for the time spent into considering (and fixing) my code

I hope can submit something more meaningful in the near future :)

mtrudel commented 2 years ago

Oh no! To be clear I didn't try and 'fix' your code at all; it was literally just a copy and paste job. If it's not exactly what you'd written in your original snippet then I messed up.

chrooti commented 2 years ago

Okay, I checked it out and it looks to me like: this: https://github.com/mtrudel/iolist_test/blob/master/lib/iolist_split_3.ex is the exact same as your implementation: https://github.com/mtrudel/iolist_test/blob/master/lib/iolist_split.ex

so maybe a copy paste job gone wrong? :)

BTW I benched the follwing example, that allows only well-formed lists, uses tail recursion and should have zero overhead

defmodule IOListSplit3 do
  @doc """
  Returns an iolist consisting of the first `length` bytes of the given list, along with the
  remainder of the iolist beyond it as a tuple. Returns `{:error, :length_exceeded}` if there are
  not `length` bytes remaining in the iolist.
  """

  @compile {:inline, split: 3}

  def split(iodata, length, taken \\ [])

  def split([head | tail], length, taken) do
    head_length = IO.iodata_length(head)
    if head_length <= length do
      split(tail, length - head_length, [head | taken])
    else
      <<head_head::binary-size(length), head_tail::binary>> = head
      {[head_head | taken], [head_tail | tail]}
    end
  end

  def split([], _length, taken) do
    {taken, []}
  end
end

and this still doesn't get close to binary_concat, if anything its performance is very close to your complete implementation, so probably it's impossible to go faster than that.

results:

binary_concat           5.19
iodata_split_3_simple   3.98 - 1.30x slower +58.70 ms
iodata_split            3.60 - 1.44x slower +85.45 ms

victorolinasc commented 1 year ago

Just leaving a comment here: recently in spawnfest there was jason_native that tries to handle these hard to optimize use cases on the BEAM with what he calls nif sprinkle. Specifically, he mentions binary patterns here

Perhaps this could be used here and some other places? I am not proficient enough on C++ and the issue at hand here... feel free to ignore my comment :)

Schultzer commented 1 year ago

Hi @mtrudel

I've refactored your iodata_split_3 solution and got some promising results:

➜  iolist_test-master iex -S mix
Erlang/OTP 25 [erts-13.1.2] [source] [64-bit] [smp:10:10] [ds:10:10:10] [async-threads:1] [jit]

Compiling 1 file (.ex)
Interactive Elixir (1.14.1) - press Ctrl+C to exit (type h() ENTER for help)
iex(1)> IOListTest.Benchmark.run
Operating System: macOS
CPU Information: Apple M1 Max
Number of Available Cores: 10
Available memory: 64 GB
Elixir 1.14.1
Erlang 25.1.2

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 10 s
memory time: 2 s
parallel: 1
inputs: none specified
Estimated total run time: 1.17 min

Benchmarking binary_concat...
Benchmarking cow_lib_split...
Benchmarking iodata_split...
Benchmarking iodata_split_3...
Benchmarking iodata_split_4...

Name                     ips        average  deviation         median         99th %
iodata_split_4       5883.13       0.170 ms   ±895.31%      0.0744 ms       0.102 ms
binary_concat          12.54       79.72 ms    ±15.82%       82.40 ms      124.12 ms
iodata_split            4.72      212.00 ms    ±40.27%      266.08 ms      277.69 ms
iodata_split_3          4.64      215.36 ms    ±40.54%      269.74 ms      292.38 ms
cow_lib_split           2.76      362.27 ms    ±48.93%      471.91 ms      496.02 ms

Comparison: 
iodata_split_4       5883.13
binary_concat          12.54 - 468.98x slower +79.55 ms
iodata_split            4.72 - 1247.24x slower +211.83 ms
iodata_split_3          4.64 - 1267.01x slower +215.19 ms
cow_lib_split           2.76 - 2131.27x slower +362.10 ms

Memory usage statistics:

Name              Memory usage
iodata_split_4        0.134 MB
binary_concat         38.18 MB - 284.44x memory usage +38.04 MB
iodata_split          75.29 MB - 560.95x memory usage +75.15 MB
iodata_split_3        75.29 MB - 560.95x memory usage +75.15 MB
cow_lib_split        114.96 MB - 856.54x memory usage +114.83 MB

**All measurements for memory usage were the same**
:ok
iex(2)>

I tested the equaltiy of my solution against iodata_split_3 to make sure I didn't intruduce any bugs.

Let me know if you want me to push a PR as I'm not really sure if you intented this for Bandit only or the IO module given your example:

# Given data like...
data = ["a", ["b", [[], "c",......,"z"]

# Instead of this...
<first_three_bytes::binary-size(3), rest::binary> = IO.iodata_to_binary(data)

# It would be good to do this (or something similar)....
{first_three_bytes, rest} = IO.split_iodata(data, 3)

mtrudel commented 1 year ago

Those are some serious numbers! I'd love to see the approach taken for this - probably the best place to land initial work for this would be in a helper module within Bandit (say, Bandit.IO?). Knowing how José likes to incorporate net-new changes to core modules, having an existing implementation 'in the wild' that has had a chance to bake in / demonstrate a long-lived usefulness would make an eventual core lib merge that much easier (to be clear, I think that's the eventual goal here).

I'd suggest opening a PR on Bandit with the implementation in a helper module & some basic test coverage. We can easily wire this up in a couple of hot paths and set the benchmarker on it to see what numbers we get.

Promising!

Schultzer commented 1 year ago

Those are some serious numbers! I'd love to see the approach taken for this - probably the best place to land initial work for this would be in a helper module within Bandit (say, Bandit.IO?). Knowing how José likes to incorporate net-new changes to core modules, having an existing implementation 'in the wild' that has had a chance to bake in / demonstrate a long-lived usefulness would make an eventual core lib merge that much easier (to be clear, I think that's the eventual goal here).

I'd suggest opening a PR on Bandit with the implementation in a helper module & some basic test coverage. We can easily wire this up in a couple of hot paths and set the benchmarker on it to see what numbers we get.

Promising!

So I got some good and bad news:

The bad news is, if we want a general IO.split_iodata that can split any iodata, then it might end up being slower then binary_concat:

Interactive Elixir (1.14.1) - press Ctrl+C to exit (type h() ENTER for help)
iex(1)> IOListTest.Benchmark.run
Operating System: macOS
CPU Information: Apple M1 Max
Number of Available Cores: 10
Available memory: 64 GB
Elixir 1.14.1
Erlang 25.1.2

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 10 s
memory time: 2 s
parallel: 1
inputs: none specified
Estimated total run time: 1.40 min

Benchmarking binary_concat...
Benchmarking cow_lib_split...
Benchmarking iodata_split...
Benchmarking iodata_split_3...
Benchmarking iodata_split_4...
Benchmarking iodata_split_5...

Name                     ips        average  deviation         median         99th %
iodata_split_4       2317.30        0.43 ms   ±552.67%       0.192 ms        0.28 ms
binary_concat          12.73       78.58 ms    ±16.87%       81.41 ms      111.02 ms
iodata_split_3          4.95      202.15 ms    ±36.42%      245.70 ms      252.58 ms
iodata_split            4.63      215.88 ms    ±39.22%      267.02 ms      274.40 ms
cow_lib_split           3.02      330.61 ms    ±54.10%      460.50 ms      485.64 ms
iodata_split_5          0.87     1149.50 ms    ±52.46%      812.07 ms     2064.65 ms

Comparison: 
iodata_split_4       2317.30
binary_concat          12.73 - 182.10x slower +78.15 ms
iodata_split_3          4.95 - 468.43x slower +201.71 ms
iodata_split            4.63 - 500.26x slower +215.45 ms
cow_lib_split           3.02 - 766.12x slower +330.18 ms
iodata_split_5          0.87 - 2663.74x slower +1149.07 ms

Memory usage statistics:

Name              Memory usage
iodata_split_4         0.34 MB
binary_concat         38.18 MB - 112.33x memory usage +37.84 MB
iodata_split_3        75.29 MB - 221.56x memory usage +74.95 MB
iodata_split          75.29 MB - 221.56x memory usage +74.95 MB
cow_lib_split        114.97 MB - 338.30x memory usage +114.63 MB
iodata_split_5        82.95 MB - 244.08x memory usage +82.61 MB

**All measurements for memory usage were the same**
:ok
iex(2)>

iodata_split_5 is a refactored version of iodata_split_4 which passes cowlib's test: https://github.com/ninenines/cowlib/blob/0f5c2f8922c89c58f51696cce690245cbdc5f327/src/cow_iolists.erl#L58L76, but it is significantly slower then the others, but there might be room for improvements.

The good news is, I believe if we can guarantee that the iodata only contains binaries (no charlist) then we can get a big speed up.

mtrudel commented 1 year ago

I'm pretty sure that we can guarantee (at least in Bandit's case) that an iolist is either a binary or a (possibly improper) recursively defined list of iolists.

Extremely excited to see this; this has been a bugaboo of mine for ages!

mtrudel commented 1 year ago

The typedef for iolists includes bytes:

iodata(): iolist() | binary()
iolist(): maybe_improper_list(byte() | binary() | iolist(), binary() | [])
byte: 0..255

I suppose that means that all charlists are also iolists (and can be validly contained in them), but if you're able to handle bytes as a valid value without performance degradation then we'll be fine. In any case, we control the construction of iolists internally within Bandit, so we can restrict the typing to not inlclude bytes if need be (though that may be a limiting factor in getting a future upstream PR accepted into the standard library if it's not a comprehensive solution).

ryanwinchester commented 1 year ago

@mtrudel I noticed @fhunleth doing some interesting iodata experiments that might be relevant: https://github.com/fhunleth/iodata_nif

mtrudel commented 1 year ago

@mtrudel I noticed @fhunleth doing some interesting iodata experiments that might be relevant:

https://github.com/fhunleth/iodata_nif

Interesting! Frank's as sharp as they come. I'm certainly going to look at this

fhunleth commented 1 year ago

Just saw this. Thanks for looking at the iodata_nif. I have a big caveat, though. My interest is in the iodata that's sent over SPI and I2C hardware buses, so it's not all that much data. I was really expecting iovecs to do better before the experiment especially since I thought my tests were unfairly biased towards them. I'm chalking that the results up to the small data sizes I'm using.

mtrudel / bandit

Improve use of iodata #4