python / cpython

The Python programming language
https://www.python.org
Other
63.51k stars 30.42k forks source link

Possible race condition on multiprocessing.Manager().dict() on macOS #87934

Closed 7e477b84-0ada-4db8-9e26-70bff58e8287 closed 10 months ago

7e477b84-0ada-4db8-9e26-70bff58e8287 commented 3 years ago
BPO 43768
Nosy @ronaldoussoren, @pitrou, @ned-deily, @applio

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['OS-mac', 'type-bug', '3.9'] title = 'Possible race condition on multiprocessing.Manager().dict() on macOS' updated_at = user = 'https://bugs.python.org/jerryc05' ``` bugs.python.org fields: ```python activity = actor = 'ned.deily' assignee = 'none' closed = False closed_date = None closer = None components = ['macOS'] creation = creator = 'jerryc05' dependencies = [] files = [] hgrepos = [] issue_num = 43768 keywords = [] message_count = 1.0 messages = ['390468'] nosy_count = 5.0 nosy_names = ['ronaldoussoren', 'pitrou', 'ned.deily', 'davin', 'jerryc05'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue43768' versions = ['Python 3.9'] ```

7e477b84-0ada-4db8-9e26-70bff58e8287 commented 3 years ago

I am not sure if this is a bug or an expected case.

Long story short, I tried to print the content of a multiprocessing.Manager().dict() in the main thread, but I got a strange error.

I encountered this error only when the number of pools is rather large (>20) and only on macOS (works perfectly on Linux).

Specs:

  CPU: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
macOS: 11.2.3

The minimum err code is attached:

#!/usr/bin/env python3

from contextlib import suppress
import multiprocessing as mp
import time

def run():
    D[mp.current_process().name] = 'some val'
    time.sleep(0.5)

if __name__ == '__main__':
    mp.set_start_method('fork')
    D, rets = mp.Manager().dict(), []
    with mp.Pool(25) as p:
        for _ in range(33):
            rets.append(p.apply_async(run))
        while rets:
            for ret in rets[:]:
                with suppress(mp.TimeoutError):
                    ret.get(timeout=0)
                    rets.remove(ret)
                    print(len(D))

Error:

multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/usr/local/Cellar/python@3.9/3.9.2_4/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/managers.py", line 801, in _callmethod
    conn = self._tls.connection
AttributeError: 'ForkAwareLocal' object has no attribute 'connection'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/Cellar/python@3.9/3.9.2_4/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/Users/???", line 9, in run
    D[mp.current_process().name] = 'some val'
  File "<string>", line 2, in __setitem__
  File "/usr/local/Cellar/python@3.9/3.9.2_4/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/managers.py", line 805, in _callmethod
    self._connect()
  File "/usr/local/Cellar/python@3.9/3.9.2_4/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/managers.py", line 792, in _connect
    conn = self._Client(self._token.address, authkey=self._authkey)
  File "/usr/local/Cellar/python@3.9/3.9.2_4/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/connection.py", line 507, in Client
    c = SocketClient(address)
  File "/usr/local/Cellar/python@3.9/3.9.2_4/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/connection.py", line 635, in SocketClient
    s.connect(address)
ConnectionRefusedError: [Errno 61] Connection refused
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/???", line 22, in <module>
    ret.get(timeout=0)
  File "/usr/local/Cellar/python@3.9/3.9.2_4/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/pool.py", line 771, in get
    raise self._value
ConnectionRefusedError: [Errno 61] Connection refused
ronaldoussoren commented 2 years ago

I tested the script on my machine (macOS 13.0.1, python 3.9, 3.10 and 3.11 all installed using the python.org installer), and the error occurs intermittently, likewise with a fresh build of 3.12. Disabling the local firewall does not avoid this problem.

This appears to be a timing problem, the main process is not yet listening to the socket when the child tries to connect.

Below is a crude hack that implements a retry loop and appears to fix the issue for me. Added as an inline patch instead of a PR because I'm far from convinced that this is a correct fix. I've barely used multiprocessing myself and know to little about its design to know what the correct place would be to implement a retry loop.

diff --git a/Lib/multiprocessing/connection.py b/Lib/multiprocessing/connection.py
index b08144f7a1..7954fefd62 100644
--- a/Lib/multiprocessing/connection.py
+++ b/Lib/multiprocessing/connection.py
@@ -625,9 +625,13 @@ def SocketClient(address):
     '''
     family = address_type(address)
     with socket.socket( getattr(socket, family) ) as s:
-        s.setblocking(True)
-        s.connect(address)
-        return Connection(s.detach())
+        for _ in range(3):
+            try:
+                s.setblocking(True)
+                s.connect(address)
+                return Connection(s.detach())
+            except socket.error:
+                time.sleep(0.5)
ronaldoussoren commented 1 year ago

@applio and/or @gpshead, what would be the correct place to implement a retry loop as sketched in my previous message? Or is retrying not the right solution here?

ronaldoussoren commented 10 months ago

This is the same problem as #101225, but with a different limit in the backlog.

ronaldoussoren commented 10 months ago

The race condition doesn't happen for me with the fix for #101225. That's technically just reducing the size of window where the race condition can happen, but should be fine given that I've increased the backlog far beyond what's needed to avoid hitting the race (famous last words...)