numaproj / numaflow-python

Numaflow Python SDK
Apache License 2.0
52 stars 18 forks source link

Multiprocess server does not load-balance correctly #111

Closed ab93 closed 10 months ago

ab93 commented 1 year ago

Describe the bug When running the multiprocess mapper, sometimes the full load is given to only one of the forked processes, while others are completely idle at 0% CPU utilization. This behavior is intermittent, as sometimes the load distribution is very even as expected. This problem is more visible when the number of processes is more than 2.

To Reproduce Steps to reproduce the behavior:

  1. Run the multiproc mapper example
  2. Investigate the CPU utilization of each of the forked processes
  3. Repeat multiple times to see inconsistent behavior

Expected behavior The number of processes spawned should see more or less equal CPU utilization

Environment (please complete the following information):


Message from the maintainers:

Impacted by this bug? Give it a 👍. We often sort issues this way to know what to prioritize.

vigith commented 1 year ago

@kohlisid please triage

kohlisid commented 10 months ago

through the debugging I was able to single out on the issue being at the network level where the clients were not establishing connections with all the servers in the backend, as this assignment is dependent on the kernel level implementation of the REUSE_PORT where it would use the required servers.

tcp6       0      0 :::55551                :::*                    LISTEN      14/python           
tcp6       0      0 :::55551                :::*                    LISTEN      16/python           
tcp6       0      0 :::55551                :::*                    LISTEN      15/python           
tcp6       0      0 :::55551                :::*                    LISTEN      17/python                            
tcp6       0      0 127.0.0.1:55551         127.0.0.1:50666         ESTABLISHED 14/python                            
tcp6       0      0 127.0.0.1:55551         127.0.0.1:50664         ESTABLISHED 16/python           
tcp6       0      0 127.0.0.1:55551         127.0.0.1:50688         ESTABLISHED 17/python           
tcp6       0      0 127.0.0.1:55551         127.0.0.1:50682         ESTABLISHED 17/python

As it can be seen here, even though there are 4 servers listening, the connection is established with only 3 and thus idling resources with the reduced performance. To fix this, I tweaked the connection to try two approaches Increase client connection pool, to establish more connections to servers which increases the chance to connect with all of them. Here trying with 8 connections for 4 servers

tcp6       0      0 127.0.0.1:55551         127.0.0.1:42458         ESTABLISHED 14/python           
tcp6      30      0 127.0.0.1:55551         127.0.0.1:42430         ESTABLISHED 17/python           
tcp6       0      0 127.0.0.1:55551         127.0.0.1:42436         ESTABLISHED 14/python           
tcp6       0      0 127.0.0.1:55551         127.0.0.1:42422         ESTABLISHED 14/python           
tcp6       0      0 127.0.0.1:55551         127.0.0.1:42442         ESTABLISHED 16/python           
tcp6       0     17 127.0.0.1:55551         127.0.0.1:42396         ESTABLISHED 15/python           
tcp6       0      0 127.0.0.1:55551         127.0.0.1:42394         ESTABLISHED 17/python           
tcp6       0      0 127.0.0.1:55551         127.0.0.1:42398         ESTABLISHED 15/python  

Even though all servers are used here, the number of connections established with each server might vary.

  1. 1:1 connections with servers using multiple ports
    tcp6       0      0 :::55551                :::*                    LISTEN      14/python                     
    tcp6       0      0 :::55554                :::*                    LISTEN      17/python                    
    tcp6       0      0 :::55552                :::*                    LISTEN      15/python           
    tcp6       0      0 :::55553                :::*                    LISTEN      16/python                            
    tcp6       0      0 127.0.0.1:55553         127.0.0.1:58092         ESTABLISHED 16/python           
    tcp6       0      0 127.0.0.1:55552         127.0.0.1:39648         ESTABLISHED 15/python           
    tcp6       0      0 127.0.0.1:55554         127.0.0.1:38862         ESTABLISHED 17/python           
    tcp6       0      0 127.0.0.1:55551         127.0.0.1:55714         ESTABLISHED 14/python   

    This seems to equally distribute the connections as expected.