Closed ab93 closed 10 months ago
@kohlisid please triage
through the debugging I was able to single out on the issue being at the network level where the clients were not establishing connections with all the servers in the backend, as this assignment is dependent on the kernel level implementation of the REUSE_PORT where it would use the required servers.
tcp6 0 0 :::55551 :::* LISTEN 14/python
tcp6 0 0 :::55551 :::* LISTEN 16/python
tcp6 0 0 :::55551 :::* LISTEN 15/python
tcp6 0 0 :::55551 :::* LISTEN 17/python
tcp6 0 0 127.0.0.1:55551 127.0.0.1:50666 ESTABLISHED 14/python
tcp6 0 0 127.0.0.1:55551 127.0.0.1:50664 ESTABLISHED 16/python
tcp6 0 0 127.0.0.1:55551 127.0.0.1:50688 ESTABLISHED 17/python
tcp6 0 0 127.0.0.1:55551 127.0.0.1:50682 ESTABLISHED 17/python
As it can be seen here, even though there are 4 servers listening, the connection is established with only 3 and thus idling resources with the reduced performance. To fix this, I tweaked the connection to try two approaches Increase client connection pool, to establish more connections to servers which increases the chance to connect with all of them. Here trying with 8 connections for 4 servers
tcp6 0 0 127.0.0.1:55551 127.0.0.1:42458 ESTABLISHED 14/python
tcp6 30 0 127.0.0.1:55551 127.0.0.1:42430 ESTABLISHED 17/python
tcp6 0 0 127.0.0.1:55551 127.0.0.1:42436 ESTABLISHED 14/python
tcp6 0 0 127.0.0.1:55551 127.0.0.1:42422 ESTABLISHED 14/python
tcp6 0 0 127.0.0.1:55551 127.0.0.1:42442 ESTABLISHED 16/python
tcp6 0 17 127.0.0.1:55551 127.0.0.1:42396 ESTABLISHED 15/python
tcp6 0 0 127.0.0.1:55551 127.0.0.1:42394 ESTABLISHED 17/python
tcp6 0 0 127.0.0.1:55551 127.0.0.1:42398 ESTABLISHED 15/python
Even though all servers are used here, the number of connections established with each server might vary.
tcp6 0 0 :::55551 :::* LISTEN 14/python
tcp6 0 0 :::55554 :::* LISTEN 17/python
tcp6 0 0 :::55552 :::* LISTEN 15/python
tcp6 0 0 :::55553 :::* LISTEN 16/python
tcp6 0 0 127.0.0.1:55553 127.0.0.1:58092 ESTABLISHED 16/python
tcp6 0 0 127.0.0.1:55552 127.0.0.1:39648 ESTABLISHED 15/python
tcp6 0 0 127.0.0.1:55554 127.0.0.1:38862 ESTABLISHED 17/python
tcp6 0 0 127.0.0.1:55551 127.0.0.1:55714 ESTABLISHED 14/python
This seems to equally distribute the connections as expected.
Describe the bug When running the multiprocess mapper, sometimes the full load is given to only one of the forked processes, while others are completely idle at 0% CPU utilization. This behavior is intermittent, as sometimes the load distribution is very even as expected. This problem is more visible when the number of processes is more than 2.
To Reproduce Steps to reproduce the behavior:
Expected behavior The number of processes spawned should see more or less equal CPU utilization
Environment (please complete the following information):
Message from the maintainers:
Impacted by this bug? Give it a 👍. We often sort issues this way to know what to prioritize.