Closed qijiax closed 1 year ago
Yes, we are aware that heavy multi core scenario which uses a lot of futexes currently has some perf bottlenecks around accessing shared AVL tree in futex wake/wait code (basically high lock contention). I started working on it some time ago, but it required some time and had low priority, so the work stalled.
I met the similar issue today.
The latest version gramine on CentOS 8.5, ran MySQL with gramine-sgx (binded on one node with numactl), used sybench oltp_read_write as workload, found that gramine-sgx mysqld only have about 30-40% performance compare to baremetal.
Here is the flame graph, can see the hot spots are the mutexes of malloc and free.
Thanks @lianghouxu for the flame graph!
We are aware that libos_syscall_ppoll()
(the Gramine emulation for poll
family of syscalls) is sub-par. It needs a re-write. But also, it is clear that malloc()
and free()
in Gramine require a global slab-allocator lock, and this is where the overhead stems from.
I met the same problem in MySQL. I wonder if we can use C99 variable length array to allocate these short-lived buffers on the stack instead? E.g. https://github.com/gramineproject/gramine/blob/72f65247af61c55c54033c3c6934ed21b607987b/libos/src/sys/libos_poll.c#L51 can be rewritten as:
long pals_buf[DIV_ROUND_UP(nfds * sizeof(PAL_HANDLE), sizeof(long))];
PAL_HANDLE* pals = (void *)&pals_buf[0];
@lejunzhu nfds
can be arbitrarily large.
What we could do is to have logic like this:
if (nfds < SOME_MAX_LIMIT) {
// new logic
PAL_HANDLE* pals = alloca(nfds * sizeof(PAL_HANDLE));
struct fds_mapping_t* fds_mapping = alloca(nfds * sizeof(struct fds_mapping_t));
pal_wait_flags_t* pal_events = alloca(nfds * sizeof(*pal_events) * 2);
allocated_on_stack = true;
} else {
// old logic
PAL_HANDLE* pals = malloc(nfds * sizeof(PAL_HANDLE));
struct fds_mapping_t* fds_mapping = malloc(nfds * sizeof(struct fds_mapping_t));
pal_wait_flags_t* pal_events = malloc(nfds * sizeof(*pal_events) * 2);
allocated_on_stack = false;
}
... do stuff ...
if (!allocated_on_stack) {
free(pals);
free(fds_mapping);
free(pal_events);
}
The SOME_MAX_LIMIT
macro should be chosen such that we don't consume more than e.g. 64KB-128KB of stack (I wonder how many nfds
it corresponds to; hopefully more than 10).
But also, if @lejunzhu wants to implement this, could you just re-structure that whole _libos_syscall_poll()
function? It looks pretty ugly and complicated, it can surely be refactored without changing the logic.
This will be at least one step in the right direction for Gramine's poll
emulation.
Could we give this to Borys instead? He did rewrites to other similar functions, so I think it makes more sense if he implemented this change.
@mkow I'd be happy but isn't Borys super-tied already with other tasks?
@lejunzhu
nfds
can be arbitrarily large.
You are right, I forgot this.
The
SOME_MAX_LIMIT
macro should be chosen such that we don't consume more than e.g. 64KB-128KB of stack (I wonder how manynfds
it corresponds to; hopefully more than 10).
These structures are very small, so I would say the number is way larger than 10. sizeof(PAL_HANDLE): 8 sizeof(struct fds_mapping_t): 16 sizeof(pal_wait_flags_t): 4 sizeof(struct pollfd)): 8
So, I think even using 1K stack space for each malloc will help.
This issue is not very urgent. If a poll() rewrite is already underway, I can wait for that. Otherwise, I can try to patch only the malloc() part along the way (libos_syscall_select -> _libos_syscall_poll -> _PalStreamsWaitEvents). But I'm afraid I don't know this part well enough for any substantial change.
So, I think even using 1K stack space for each malloc will help.
Yes, this will be good! From my experience, typically non-network-bound applications use about 5-10-20 FDs in poll
, no more than that.
This issue is not very urgent. If a poll() rewrite is already underway, I can wait for that.
No, the poll()
rewrite was not yet started. In fact, this task has a very low priority currently. I don't know if @boryspoplawski has anything else to say here.
1K sounds way to much for a stack buffer
1K sounds way to much for a stack buffer
Do you mean a lower threshold, e.g. 320 bytes for each buffer is acceptable, or this problem should better be addressed with a poll rewrite?
For 20 FDs, we'll need 4 160 bytes and 1 320 bytes buffers on the stack.
I've created a PR #1051 to reduce the lock contention in poll. @dimakuv and @boryspoplawski would you please take a look at it?
@lianghouxu would you please to test this PR as well, to see if it improves MySQL performance in your environment?
@lejunzhu I test the PR #1051 , the performance looks good. Before applying the patch, with 120 connections, mysql in gramine-sgx can achieve about 30% of bare metal. After applying the patch, with 120 connections, mysql in gramine-sgx can achieve about 50% of bare metal.
Here is the Flame Graph, the hotspots of malloc and free have gone.
From the flame graph, the new bottleneck seems to be get_fd_handle(). It is also reported in #853.
I'm closing this issue now as it became a duplicate of #853
Description of the problem
We used HammerDB to benchmark the performance of MariaDB in Gramine. We tested on a 2-socket platform, there are 56 physical cores (112 logical cores) per socket. The HammerDB virtual users(VU) is set to 16. The Gamine-MariaDB service is run bind to one socket. If we do no limitation on cores, the HammerDB result NOPM is 270k. The CPU utilization is 1600%. If we limit the cores to 8 physical cores/16 logical cores (8C16T), the HammerDB result NOPM is only 70k, that is much worse. The CPU utilization is also approach to 1600%. As a comparison, MariaDB without Gramine NOPM is 360k in no core limitation and 230k in 8C16T.
I recorded the perf info, it shows that "libos_syscall_futex" occupied much CPU resources on 8C16T scenario.
Steps to reproduce
The configuration is same as https://github.com/gramineproject/gramine/issues/853
Gramine is on this branch: https://github.com/gramineproject/gramine/tree/borys/test_rm_pal_iov
MariaDB service is bind to socket 0:
HammerDB config: Warehouse: 200 VU: 16 With SSL connection
Expected results
8C16T has 75% performance of no core limitation