threading model in dw_node

tomcucinotta commented 11 months ago

Currently, the --per-client-thread command-line option does not do what it seems from its name: it spawns a predefined number of threads (it was equal to MAX_BUFFERS, but now it's specified independently as MAX_THREADS, since commit 6fcca0b9). These are used in round-robin fashion as the main thread keeps accept()ing new client connections.

This option should be renamed as smth. like --threads , where one can specify the desired parallelism level , up to a maximum of MAX_THREADS.

Second, the current implementation forces a context-switch at every accept(), increasing unneedlessly the first request latency. A better approach would be to have multiple threads:

either accept()ing directly on the same socket;
or binding their own socket to the same IP:port, using SO_REUSEPORT.

The second one should be better, according to man 7 socket: SO_REUSEPORT [...] For TCP sockets, this option allows accept(2) load distribution in a multi-threaded server to be improved by using a distinct listener socket for each thread. This provides improved load distribution as compared to traditional techniques such using a single accept(2)ing thread that distributes connections, or having multiple threads that compete to accept(2) from the same socket. We could try to measure these differences, though....

Further note from man accept: There may not always be a connection waiting after a SIGIO is delivered or select(2), poll(2), or epoll(7) return a readability event because the connection might have been removed by an asynchronous network error or another thread before accept() is called. If this happens, then the call will block waiting for the next connection to arrive. To ensure that accept() never blocks, the passed socket sockfd needs to have the O_NONBLOCK flag set (see socket(7)).

tomcucinotta commented 11 months ago

Further useful pointers about the "thundering herd" in multi-threaded accept()/select()/epoll() based servers:

EPOLLEXCLUSIVE in man epoll_ctl (relevant for multiple epollfd waiting on the same bound socket fd, ie, the first solution in bullet list above);
https://lwn.net/Articles/637435/
https://lwn.net/Articles/632590/
https://uwsgi-docs.readthedocs.io/en/latest/articles/SerializingAccept.html
"accept() scalability on Linux," Univ. Michigan: http://www.citi.umich.edu/projects/linux-scalability/reports/accept.html

deRemo commented 10 months ago

Another interesting article on SO_REUSEPORT: https://lwn.net/Articles/542629/; according to the article, we can simply adopt the second option without much worrying about a comparison

tomcucinotta commented 10 months ago

another problem with the current code base, is that buf_alloc() is called always from the same thread, but the allocated memory buffers are later used from the worker threads. In the new solution, we should call buf_alloc() from the various worker threads, so to have some advantage in allocated memory locality on NUMA servers (ideally, with worker threads pinned to physical cores, yet another cmd-line option to add). (moving this part to a separate issue #25)

deRemo commented 10 months ago

(Note: commit f029e79 doesn't exist anymore due to typo in comment)

deRemo commented 10 months ago

another problem with the current code base, is that buf_alloc() is called always from the same thread, but the allocated memory buffers are later used from the worker threads. In the new solution, we should call buf_alloc() from the various worker threads, so to have some advantage in allocated memory locality on NUMA servers (ideally, with worker threads pinned to physical cores, yet another cmd-line option to add). (moving this part to a separate issue #25)

This problem should be addressed by default in 4d271aa, since I repurposed epoll_main_loop() itself as the thread body

tomcucinotta commented 10 months ago

nothing left here after merging PR #26, closing.

tomcucinotta / distwalk

threading model in dw_node #24