python / cpython

The Python programming language
https://www.python.org
Other
63.22k stars 30.27k forks source link

POSIX multiprocessing spawn performance becomes 10x slower from a certain pickle size #96953

Open jhoekx opened 2 years ago

jhoekx commented 2 years ago

Bug report

We are using multiprocessing with the spawn start method. On my 32-thread PC, starting all worker processes for my project used to take 2 seconds. At a certain point, it jumped straight to taking 20 seconds.

The slowdown appears as soon as more than 64 KB needs to be sent to a child process over the pipe.

Consider this minimal reproduction case:

#!/usr/bin/python3

import multiprocessing
import random
import sys
import time

class Container:

    def __init__(self, size):
        self.data = random.randbytes(size)

class ChildProcess(multiprocessing.Process):

    def __init__(self, name: str, container):
        super().__init__(name=name)
        self.container = container

    def run(self) -> None:
        print("Running")

def run():
    fixed_overhead_3_9 = 885
    difference = int(sys.argv[1])
    container = Container((64*1024) - fixed_overhead_3_9 + difference)
    children = [ChildProcess(f"child-{i}", container) for i in range(0, 2)]

    start_time = time.perf_counter()
    for child in children:
        child.start()
    end_time = time.perf_counter()

    print(f"Running took {int((end_time - start_time) * 1000)}ms")

    for child in children:
        child.join()

if __name__ == "__main__":
    multiprocessing.set_start_method("spawn")
    run()

I added some "instrumentation" in multiprocessing/popen_spawn_posix.py to print the buffer size:

        try:
            reduction.dump(prep_data, fp)
            reduction.dump(process_obj, fp)
        finally:
            set_spawning_popen(None)
        print(len(fp.getbuffer()))
        parent_r = child_w = child_r = parent_w = None

Running the example results in:

$ python3.9 mp_pipe_limits.py 0
65536
65536
Running took 9ms
Running
Running
$ python3.9 mp_pipe_limits.py 1
65537
65537
Running
Running took 96ms
Running

Changing the pipe size with fcntl in multiprocessing/popen_spawn_posix.py restores performance:

        parent_r = child_w = child_r = parent_w = None
        try:
            parent_r, child_w = os.pipe()
            child_r, parent_w = os.pipe()
            fcntl.fcntl(parent_w, 1031, 100000)
            cmd = spawn.get_command_line(tracker_fd=tracker_fd,
                                         pipe_handle=child_r)

Where 1031 is fcntl.F_SETPIPE_SZ, which is not in Python 3.9.

Rerunning the reproduction case after this change:

$ python3.9 mp_pipe_limits.py 1
65537
65537
Running took 9ms
Running
Running

Of course, changing the pipe size will only delay the onset of the problem. The real solution (if there is any) will probably be different. Blindly setting a pipe size might also not be safe as it depends on limits set in /proc.

The example above is a best case example, since it has very limited pickle overhead. We hit this limit without any data caches involved. It's just our Python objects that live after application initialization. They are slower to pickle. However, then things are still 10x slower, so not just a fixed 80ms as seen in the example.

We use spawn instead of fork on Linux to avoid troubles with objects that cannot be pickled on other OSs (Windows).

Your environment

gpshead commented 1 year ago

from an outside black box point of view this issue smells like round trips between processes when more than one read or write call are required on the pipe is taking a surprising amount of time.

While they could have different causes, it wouldn't surprise me if #98493 is found to be related.