Closed craff closed 1 year ago
The allocation using alloca
on the C stack is basically cost free and cheaper than any other form of allocation either using malloc or using the OCaml heap because the latter requires garbage collection. Allocation with alloca
is just a pointer operation and memory is freed as soon as possible. The fact that this buffer is allocated at every invocation of epoll_wait
and does not survive the call is irrelevant. I don't believe that pre-allocating a buffer which is GC'ed later is cheaper or faster. So this would require a benchmark to convince me otherwise.
Maybe epoll_wait
and epoll_wait_fold
could be unified? The latter is the more general case.
Note that if you want the highest performance you should look at using uring
(https://github.com/ocaml-multicore/ocaml-uring), which works with OCaml 4.x too (with 5.x you get a higher level and more convenient interface to it with Eio). It is a much lower level interface though.
polly
is useful in situations where you do not have uring
: an older kernel, when you are not root (in case 'uring' has been restricted to root only), or if uring has been compiled out of the kernel. It is also useful when you want to debug your program using strace
(uring
avoids making any syscalls in some modes, so it is harder to debug).
Also polly
doesn't allocate memory on the OCaml side, and requires no work from the GC, whereas 'uring' does perform small allocations in a few places.
I would expect the performance of most functions in polly
to be dominated by the syscall time (which is quite slow, especially with all the security mitigations in place).
I have some benchmark code for a small HTTP load generator built using polly, but I don't have benchmark code for polly itself. IIRC 'polly' didn't show up in my performance profiles in a significant way, so I haven't looked at optimizing it, but I don't have concrete numbers at hand. When I have some time I can try to rerun the benchmarks and see how much time 'Polly.wait_fold' and 'Polly.upd' takes up.
It is still scanned by the minor GC? I was going to benchmark, but I forgot the power outlet of my laptop.
Le 6 août 2023 08:12:19 GMT-10:00, Christian Lindig @.***> a écrit :
The allocation using
alloca
on the C stack is basically cost free and cheaper than any other form of allocation either using malloc or using the OCaml heap because the latter requires garbage collection. Allocation withalloca
is just a pointer operation and memory is freed as soon as possible. The fact that this buffer is allocated at every invocation ofepoll_wait
and does not survive the call is irrelevant. I don't believe that pre-allocating a buffer which is GC'ed later is cheaper or faster. So this would require a benchmark to convince me otherwise.-- Reply to this email directly or view it on GitHub: https://github.com/lindig/polly/pull/11#issuecomment-1666936176 You are receiving this because you authored the thread.
Message ID: @.***>
Maybe my understanding is wrong (@edwintorok please correct me), but alloca
reserves space on the C stack by increasing the current stack frame. It is invisible to the OCaml GC and is automatically de-allocated when the C function returns. This is why it is so cheap. The only potential problem are very large numbers of val_max
because it could overflow the stack. But the structure we are allocating per instance is small: about two 64-bit words.
Isn't there a bzero at stack allocation or deallocation?
Le 6 août 2023 08:12:19 GMT-10:00, Christian Lindig @.***> a écrit :
The allocation using
alloca
on the C stack is basically cost free and cheaper than any other form of allocation either using malloc or using the OCaml heap because the latter requires garbage collection. Allocation withalloca
is just a pointer operation and memory is freed as soon as possible. The fact that this buffer is allocated at every invocation ofepoll_wait
and does not survive the call is irrelevant. I don't believe that pre-allocating a buffer which is GC'ed later is cheaper or faster. So this would require a benchmark to convince me otherwise.-- Reply to this email directly or view it on GitHub: https://github.com/lindig/polly/pull/11#issuecomment-1666936176 You are receiving this because you authored the thread.
Message ID: @.***>
Variables allocated on the stack are uninitialized, there is no need for a bzero
there. You can look at the generated assembly code, I wouldn't worry about the performance of the alloca
, it is quite similar to allocating an array on the stack.
Using objdump -dS
on _build/default/src/main.exe
shows the C source code together with the disassembly (approximatively, due to optimization the code is rearranged a bit, and depending on OS and compiler version the exact code may be different a bit). You can see that the alloca
is basically just sub %rax, %rsp
(where %rax
is computed based on val_max
):
if (Int_val(val_max) <= 0)
52fc18: 85 c0 test %eax,%eax
52fc1a: 0f 8e ed 00 00 00 jle 52fd0d <caml_polly_wait_fold+0x1dd>
caml_invalid_argument(__FUNCTION__);
events =
(struct epoll_event *)alloca(Int_val(val_max) *
52fc20: 48 98 cltq
52fc22: 48 8d 04 40 lea (%rax,%rax,2),%rax
52fc26: 48 8d 04 85 17 00 00 lea 0x17(,%rax,4),%rax
52fc2d: 00
52fc2e: 48 83 e0 f0 and $0xfffffffffffffff0,%rax
52fc32: 48 29 c4 sub %rax,%rsp
sizeof(struct epoll_event));
caml_enter_blocking_section();
52fc35: e8 66 60 02 00 call 555ca0 <caml_enter_blocking_section>
Variables allocated on the stack are uninitialized, there is no need for a
bzero
there. You can look at the generated assembly code, I wouldn't worry about the performance of thealloca
, it is quite similar to allocating an array on the stack. Usingobjdump -dS
on_build/default/src/main.exe
shows the C source code together with the disassembly (approximatively, due to optimization the code is rearranged a bit, and depending on OS and compiler version the exact code may be different a bit). You can see that thealloca
is basically justsub %rax, %rsp
(where%rax
is computed based onval_max
):if (Int_val(val_max) <= 0) 52fc18: 85 c0 test %eax,%eax 52fc1a: 0f 8e ed 00 00 00 jle 52fd0d <caml_polly_wait_fold+0x1dd> caml_invalid_argument(__FUNCTION__); events = (struct epoll_event *)alloca(Int_val(val_max) * 52fc20: 48 98 cltq 52fc22: 48 8d 04 40 lea (%rax,%rax,2),%rax 52fc26: 48 8d 04 85 17 00 00 lea 0x17(,%rax,4),%rax 52fc2d: 00 52fc2e: 48 83 e0 f0 and $0xfffffffffffffff0,%rax 52fc32: 48 29 c4 sub %rax,%rsp sizeof(struct epoll_event)); caml_enter_blocking_section(); 52fc35: e8 66 60 02 00 call 555ca0 <caml_enter_blocking_section>
Thanks for the information. This PR is therefore clearly useless. I close it.
Introduces EventBuffer to avoid allocating at each wait/wait_fold.
Warning: This PR is based on constant3 (PR#10) and not master, because I added one line in src/constants.ml
Questions:
EventBuffer.t = int * Bytes.t
or justBytes.t
and to a division to compute val_max ?