tomcucinotta / distwalk

Distributed processing emulation tool
GNU General Public License v3.0
1 stars 4 forks source link

threading model in dw_node #24

Closed tomcucinotta closed 10 months ago

tomcucinotta commented 11 months ago

Currently, the --per-client-thread command-line option does not do what it seems from its name: it spawns a predefined number of threads (it was equal to MAX_BUFFERS, but now it's specified independently as MAX_THREADS, since commit 6fcca0b9). These are used in round-robin fashion as the main thread keeps accept()ing new client connections.

This option should be renamed as smth. like --threads , where one can specify the desired parallelism level , up to a maximum of MAX_THREADS.

Second, the current implementation forces a context-switch at every accept(), increasing unneedlessly the first request latency. A better approach would be to have multiple threads:

The second one should be better, according to man 7 socket: SO_REUSEPORT [...] For TCP sockets, this option allows accept(2) load distribution in a multi-threaded server to be improved by using a distinct listener socket for each thread. This provides improved load distribution as compared to traditional techniques such using a single accept(2)ing thread that distributes connections, or having multiple threads that compete to accept(2) from the same socket. We could try to measure these differences, though....

Further note from man accept: There may not always be a connection waiting after a SIGIO is delivered or select(2), poll(2), or epoll(7) return a readability event because the connection might have been removed by an asynchronous network error or another thread before accept() is called. If this happens, then the call will block waiting for the next connection to arrive. To ensure that accept() never blocks, the passed socket sockfd needs to have the O_NONBLOCK flag set (see socket(7)).

tomcucinotta commented 11 months ago

Further useful pointers about the "thundering herd" in multi-threaded accept()/select()/epoll() based servers:

deRemo commented 10 months ago

Another interesting article on SO_REUSEPORT: https://lwn.net/Articles/542629/; according to the article, we can simply adopt the second option without much worrying about a comparison

tomcucinotta commented 10 months ago

another problem with the current code base, is that buf_alloc() is called always from the same thread, but the allocated memory buffers are later used from the worker threads. In the new solution, we should call buf_alloc() from the various worker threads, so to have some advantage in allocated memory locality on NUMA servers (ideally, with worker threads pinned to physical cores, yet another cmd-line option to add). (moving this part to a separate issue #25)

deRemo commented 10 months ago

(Note: commit f029e79 doesn't exist anymore due to typo in comment)

deRemo commented 10 months ago

another problem with the current code base, is that buf_alloc() is called always from the same thread, but the allocated memory buffers are later used from the worker threads. In the new solution, we should call buf_alloc() from the various worker threads, so to have some advantage in allocated memory locality on NUMA servers (ideally, with worker threads pinned to physical cores, yet another cmd-line option to add). (moving this part to a separate issue #25)

This problem should be addressed by default in 4d271aa, since I repurposed epoll_main_loop() itself as the thread body

tomcucinotta commented 10 months ago

nothing left here after merging PR #26, closing.