lindig / polly

OCaml bindings for Linux epoll(2)
MIT License
9 stars 4 forks source link

Wait buffer #11

Closed craff closed 1 year ago

craff commented 1 year ago

Introduces EventBuffer to avoid allocating at each wait/wait_fold.

Warning: This PR is based on constant3 (PR#10) and not master, because I added one line in src/constants.ml

Questions:

lindig commented 1 year ago

The allocation using alloca on the C stack is basically cost free and cheaper than any other form of allocation either using malloc or using the OCaml heap because the latter requires garbage collection. Allocation with alloca is just a pointer operation and memory is freed as soon as possible. The fact that this buffer is allocated at every invocation of epoll_wait and does not survive the call is irrelevant. I don't believe that pre-allocating a buffer which is GC'ed later is cheaper or faster. So this would require a benchmark to convince me otherwise.

lindig commented 1 year ago

Maybe epoll_wait and epoll_wait_fold could be unified? The latter is the more general case.

edwintorok commented 1 year ago

Note that if you want the highest performance you should look at using uring (https://github.com/ocaml-multicore/ocaml-uring), which works with OCaml 4.x too (with 5.x you get a higher level and more convenient interface to it with Eio). It is a much lower level interface though.

polly is useful in situations where you do not have uring: an older kernel, when you are not root (in case 'uring' has been restricted to root only), or if uring has been compiled out of the kernel. It is also useful when you want to debug your program using strace (uring avoids making any syscalls in some modes, so it is harder to debug). Also polly doesn't allocate memory on the OCaml side, and requires no work from the GC, whereas 'uring' does perform small allocations in a few places. I would expect the performance of most functions in polly to be dominated by the syscall time (which is quite slow, especially with all the security mitigations in place).

I have some benchmark code for a small HTTP load generator built using polly, but I don't have benchmark code for polly itself. IIRC 'polly' didn't show up in my performance profiles in a significant way, so I haven't looked at optimizing it, but I don't have concrete numbers at hand. When I have some time I can try to rerun the benchmarks and see how much time 'Polly.wait_fold' and 'Polly.upd' takes up.

craff commented 1 year ago

It is still scanned by the minor GC? I was going to benchmark, but I forgot the power outlet of my laptop.

Le 6 août 2023 08:12:19 GMT-10:00, Christian Lindig @.***> a écrit :

The allocation using alloca on the C stack is basically cost free and cheaper than any other form of allocation either using malloc or using the OCaml heap because the latter requires garbage collection. Allocation with alloca is just a pointer operation and memory is freed as soon as possible. The fact that this buffer is allocated at every invocation of epoll_wait and does not survive the call is irrelevant. I don't believe that pre-allocating a buffer which is GC'ed later is cheaper or faster. So this would require a benchmark to convince me otherwise.

-- Reply to this email directly or view it on GitHub: https://github.com/lindig/polly/pull/11#issuecomment-1666936176 You are receiving this because you authored the thread.

Message ID: @.***>

lindig commented 1 year ago

Maybe my understanding is wrong (@edwintorok please correct me), but alloca reserves space on the C stack by increasing the current stack frame. It is invisible to the OCaml GC and is automatically de-allocated when the C function returns. This is why it is so cheap. The only potential problem are very large numbers of val_max because it could overflow the stack. But the structure we are allocating per instance is small: about two 64-bit words.

craff commented 1 year ago

Isn't there a bzero at stack allocation or deallocation?

Le 6 août 2023 08:12:19 GMT-10:00, Christian Lindig @.***> a écrit :

The allocation using alloca on the C stack is basically cost free and cheaper than any other form of allocation either using malloc or using the OCaml heap because the latter requires garbage collection. Allocation with alloca is just a pointer operation and memory is freed as soon as possible. The fact that this buffer is allocated at every invocation of epoll_wait and does not survive the call is irrelevant. I don't believe that pre-allocating a buffer which is GC'ed later is cheaper or faster. So this would require a benchmark to convince me otherwise.

-- Reply to this email directly or view it on GitHub: https://github.com/lindig/polly/pull/11#issuecomment-1666936176 You are receiving this because you authored the thread.

Message ID: @.***>

edwintorok commented 1 year ago

Variables allocated on the stack are uninitialized, there is no need for a bzero there. You can look at the generated assembly code, I wouldn't worry about the performance of the alloca, it is quite similar to allocating an array on the stack. Using objdump -dS on _build/default/src/main.exe shows the C source code together with the disassembly (approximatively, due to optimization the code is rearranged a bit, and depending on OS and compiler version the exact code may be different a bit). You can see that the alloca is basically just sub %rax, %rsp (where %rax is computed based on val_max):

      if (Int_val(val_max) <= 0)
  52fc18:       85 c0                   test   %eax,%eax
  52fc1a:       0f 8e ed 00 00 00       jle    52fd0d <caml_polly_wait_fold+0x1dd>
                caml_invalid_argument(__FUNCTION__);
        events =
            (struct epoll_event *)alloca(Int_val(val_max) *
  52fc20:       48 98                   cltq
  52fc22:       48 8d 04 40             lea    (%rax,%rax,2),%rax
  52fc26:       48 8d 04 85 17 00 00    lea    0x17(,%rax,4),%rax
  52fc2d:       00 
  52fc2e:       48 83 e0 f0             and    $0xfffffffffffffff0,%rax
  52fc32:       48 29 c4                sub    %rax,%rsp
                                         sizeof(struct epoll_event));

        caml_enter_blocking_section();
  52fc35:       e8 66 60 02 00          call   555ca0 <caml_enter_blocking_section>
craff commented 1 year ago

Variables allocated on the stack are uninitialized, there is no need for a bzero there. You can look at the generated assembly code, I wouldn't worry about the performance of the alloca, it is quite similar to allocating an array on the stack. Using objdump -dS on _build/default/src/main.exe shows the C source code together with the disassembly (approximatively, due to optimization the code is rearranged a bit, and depending on OS and compiler version the exact code may be different a bit). You can see that the alloca is basically just sub %rax, %rsp (where %rax is computed based on val_max):

      if (Int_val(val_max) <= 0)
  52fc18:       85 c0                   test   %eax,%eax
  52fc1a:       0f 8e ed 00 00 00       jle    52fd0d <caml_polly_wait_fold+0x1dd>
                caml_invalid_argument(__FUNCTION__);
        events =
            (struct epoll_event *)alloca(Int_val(val_max) *
  52fc20:       48 98                   cltq
  52fc22:       48 8d 04 40             lea    (%rax,%rax,2),%rax
  52fc26:       48 8d 04 85 17 00 00    lea    0x17(,%rax,4),%rax
  52fc2d:       00 
  52fc2e:       48 83 e0 f0             and    $0xfffffffffffffff0,%rax
  52fc32:       48 29 c4                sub    %rax,%rsp
                                         sizeof(struct epoll_event));

        caml_enter_blocking_section();
  52fc35:       e8 66 60 02 00          call   555ca0 <caml_enter_blocking_section>

Thanks for the information. This PR is therefore clearly useless. I close it.