python / cpython

The Python programming language
https://www.python.org
Other
63.35k stars 30.34k forks source link

Leaked semaphore objects in test_concurrent_futures #104090

Open furkanonder opened 1 year ago

furkanonder commented 1 year ago

Bug report

$ ./python -m test test_concurrent_futures
0:00:00 load avg: 5.30 Run tests sequentially
0:00:00 load avg: 5.30 [1/1] test_concurrent_futures
/workspace/cpython/Lib/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=150426) is multi-threaded, use of fork() may lead to deadlocks in the child.
  self.pid = os.fork()
/workspace/cpython/Lib/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 6 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
Traceback (most recent call last):
  File "/workspace/cpython/Lib/multiprocessing/util.py", line 300, in _run_finalizers
    finalizer()
  File "/workspace/cpython/Lib/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/cpython/Lib/multiprocessing/synchronize.py", line 87, in _cleanup
    sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
Traceback (most recent call last):
  File "/workspace/cpython/Lib/multiprocessing/util.py", line 300, in _run_finalizers
    finalizer()
  File "/workspace/cpython/Lib/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/cpython/Lib/multiprocessing/synchronize.py", line 87, in _cleanup
    sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
Traceback (most recent call last):
  File "/workspace/cpython/Lib/multiprocessing/util.py", line 300, in _run_finalizers
    finalizer()
  File "/workspace/cpython/Lib/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/cpython/Lib/multiprocessing/synchronize.py", line 87, in _cleanup
    sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
Traceback (most recent call last):
  File "/workspace/cpython/Lib/multiprocessing/util.py", line 300, in _run_finalizers
    finalizer()
  File "/workspace/cpython/Lib/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/cpython/Lib/multiprocessing/synchronize.py", line 87, in _cleanup
    sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
Traceback (most recent call last):
  File "/workspace/cpython/Lib/multiprocessing/util.py", line 300, in _run_finalizers
    finalizer()
  File "/workspace/cpython/Lib/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/cpython/Lib/multiprocessing/synchronize.py", line 87, in _cleanup
    sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
Traceback (most recent call last):
  File "/workspace/cpython/Lib/multiprocessing/util.py", line 300, in _run_finalizers
    finalizer()
  File "/workspace/cpython/Lib/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/cpython/Lib/multiprocessing/synchronize.py", line 87, in _cleanup
    sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
test_concurrent_futures passed in 2 min 17 sec

== Tests result: SUCCESS ==

1 test OK.

Total duration: 2 min 17 sec
Tests result: SUCCESS

Your environment

Linked PRs

sunmy2019 commented 1 year ago

The leakage is from FailingInitializerMixin. Don't know how to fix though.

sobolevn commented 1 year ago

Two failing tests:

bityob commented 1 year ago
  1. Found out using git bisect that the first failed commit is 6883007a86bdf0d7cf4560b949fd5e577dab1013, https://github.com/python/cpython/pull/12271
  2. I splited the commit to multiple new commits each per file, so I can use git bisect again to find the specific problematic file
  3. The file is unittest/result.py and this is the git diff
    
    diff --git a/Lib/unittest/result.py b/Lib/unittest/result.py
    index 5ca4c23238..fa9bea47c8 100644
    --- a/Lib/unittest/result.py
    +++ b/Lib/unittest/result.py
    @@ -43,6 +43,7 @@ def __init__(self, stream=None, descriptions=None, verbosity=None):
         self.skipped = []
         self.expectedFailures = []
         self.unexpectedSuccesses = []
    +        self.collectedDurations = []
         self.shouldStop = False
         self.buffer = False
         self.tb_locals = False
    @@ -157,6 +158,12 @@ def addUnexpectedSuccess(self, test):
         """Called when a test was expected to fail, but succeed."""
         self.unexpectedSuccesses.append(test)

OK

4. I found out with some debugging and print that on the failed case (`collectedDurations.append`) the `resource_tracker` is called before the `sem_unlink` which will try to unlink them later and fail, since the resource tracker already did it. Only 3 semaphore objects are failing here, trying to unlink **after** `addDuration` function. 

case.py(638)run() -> self._callTearDown() -> run sem_unlink(name) on /mp-4lu9pdgs case.py(638)run() -> self._callTearDown() -> run sem_unlink(name) on /mp-61ynpjfw case.py(638)run() -> self._callTearDown() -> run sem_unlink(name) on /mp-wrenmqfa case.py(638)run() -> self._callTearDown() -> run sem_unlink(name) on /mp-e01eptc5 case.py(638)run() -> self._callTearDown() -> run sem_unlink(name) on /mp-sssjxme9 0.27s run addDuration Lib/multiprocessing/resource_tracker.py:236: UserWarning: resource_tracker: There appear to be 3 leaked semaphore objects to clean up at shutdown call _CLEANUP_FUNCS semaphore function for /mp-u8k6h3v7 file call _CLEANUP_FUNCS semaphore function for /mp-ihqrt4f2 file call _CLEANUP_FUNCS semaphore function for /mp-n4h31gcc file suite.py(132)run() -> self._handleModuleTearDown(result) -> run sem_unlink(name) on /mp-n4h31gcc FileNotFoundError: [Errno 2] No such file or directory suite.py(132)run() -> self._handleModuleTearDown(result) -> run sem_unlink(name) on /mp-u8k6h3v7 FileNotFoundError: [Errno 2] No such file or directory suite.py(132)run() -> self._handleModuleTearDown(result) -> run sem_unlink(name) on /mp-ihqrt4f2 FileNotFoundError: [Errno 2] No such file or directory

(reduced the output so it will be clearer)
5. In other hand, when I ignore this line, the `sem_ulink` for those 3 semaphore objects are called suceesfully after the `addDuration` function   *but from another trace*

case.py(638)run() -> self._callTearDown() -> run sem_unlink(name) on /mp-ch65p5qm case.py(638)run() -> self._callTearDown() -> run sem_unlink(name) on /mp-z2m0y_8x case.py(638)run() -> self._callTearDown() -> run sem_unlink(name) on /mp-mi_764q8 case.py(638)run() -> self._callTearDown() -> run sem_unlink(name) on /mp-ntvllres case.py(638)run() -> self._callTearDown() -> run sem_unlink(name) on /mp-f5553813 0.27s run addDuration .suite.py(122)test(result) -> suite.py(84)return self.run(*args, kwds) -> run sem_unlink(name) on /mp-jvota4jy suite.py(122)test(result) -> suite.py(84)return self.run(*args, *kwds) -> run sem_unlink(name) on /mp-hxf6fzql suite.py(122)test(result) -> suite.py(84)return self.run(args, kwds) -> run sem_unlink(name) on /mp-fusub384


6. I'm trying now to understand what caused the `resource_tracker` to trigger befure the `sem_unlink`. For now I think that for some reason the `sem_unlink` from `TestSuite` line 122 is not reached, and thefore the resouce tracker is trying to release the resources. And later the call to `sem_unlink` from `TestSuite` line 132 is trying to run and fails
![image](https://github.com/python/cpython/assets/20026896/91175912-ef3c-4a0d-90c3-874d8758fa1f)
7. The key is to understand why when the `self.collectedDurations` is not empty, it blocks us from cleaning as usual those 3 semaphore objects
bityob commented 1 year ago

Ok, found more - The leaked semaphore objects are from FailingInitializerMixin (as @sunmy2019 mentioned before, but now it's clearer for me).

image

This line creates a new queue with 3 semaphore related and when the collectedDurations is not empty the self.log_queue is not released (still in used I guess)

################
FailingInitializerMixin.setUp: Calling self.mp_context.Queue()
Queue __init__: Set self._rlock = ctx.Lock()
Add semlock to register and finalize self._semlock.name='/mp-lnmmngl0', self.__class__.__name__='Lock'
Queue __init__: Set self._wlock = ctx.Lock()
Add semlock to register and finalize self._semlock.name='/mp-5fhxbfsv', self.__class__.__name__='Lock'
Queue __init__: Set ctx.BoundedSemaphore(maxsize)
Add semlock to register and finalize self._semlock.name='/mp-8mzh94sd', self.__class__.__name__='BoundedSemaphore'
################

Still investigating

shailshouryya commented 1 year ago

For some additional context, I also ran git bisect and landed on the same commit identified above. I did not get as far as the post above me did in terms of identifying the exact reasons why the 6 leaked semaphore objects were not cleaned up before shutdown, but I also narrowed down the problem and found the 6 leaked semaphore objects warning goes away when the following block (also mentioned above):

    def addDuration(self, test, elapsed):
        """Called when a test finished to run, regardless of its outcome."""
        # support for a TextTestRunner using an old TestResult class
        if hasattr(self, "collectedDurations"):
            self.collectedDurations.append((test, elapsed))

is completely commented out:

#    def addDuration(self, test, elapsed):
#        """Called when a test finished to run, regardless of its outcome."""
#        # support for a TextTestRunner using an old TestResult class
#        if hasattr(self, "collectedDurations"):
#            self.collectedDurations.append((test, elapsed))

Another thing that I found interesting was the leaked semaphore objects look like TemporaryFiles:

/mp-ch65p5qm
/mp-z2m0y_8x
/mp-mi_764q8
/mp-ntvllres
/mp-f5553813

However, I'm not sure about this, since it does not look like TemporaryFileis called anywhere in test_concurrent_futures.py except for https://github.com/python/cpython/blob/89867d2491c0c3ef77bc237899b2f0762f43c03c/Lib/test/test_concurrent_futures.py#L1248-L1249 but this does not look relevant for the problem here (but I'm adding this observation here in case it is relevant).

bityob commented 1 year ago

@shailshouryya

Not related to TemporaryFile, those are virtual files used by each semaphore object as the system locking mechanisem

Accessing named semaphores via the file system On Linux, named semaphores are created in a virtual file system, normally mounted under /dev/shm

Source: https://linux.die.net/man/7/sem_overview

For example, if you stop the process in the middle

image

you will be able to see those virtual files here

image

which will be deleted on _multiprocessing.sem_unlink method

bityob commented 1 year ago

Ok, found how to fix the issue! 🚀

Full details

The current code calls self.collectedDurations.append((test, elapsed)) on addDuration method (as wrote in previous comment).

This causes the 3 sempahors of FailingInitializerMixin.log_queue to not being released before the resource tracker is stopped and clear by itself all left resources.

Example 1 (without .collectedDurations.append), the ResourceTracker is stopped after we clear the 3 semaphors -

image

Example 2 (current code), the ResourceTracker is stopped before we clear the 3 semaphors -

image

This happens since we call _run_finalizers() after resource_tracker._resource_tracker._stop() image

So, optional solution No.1 will be to change the order of the calls, like this -

image

And it solves the issue perfectly image

@vstinner - I saw that you were the one who wrote this part, Can we do such change? If yes, I'll make a PR for it.

@iritkatriel - I'll happy to here what you think here, thanks.

Still open question for me, is why the objects are not finalize normally (by the gc I assume) like it was before. I'm still investigating on this

bityob commented 1 year ago

Hey,

Found better solution and also what's happening here.

The reason those 3 semaphores were still alive and not deleted, it's because there weren't garabage collected yet, since they have a reference.

How?

Those 3 semaphors are related to FailingInitializerMixin.log_queue object (they are created as part of this queue).

This mixin is used to create the test class ProcessPoolSpawnFailingInitializerTest which was used on the test result private object collectedDurations, and since the result object continue to live after the test object ended, the test object can't be collected and therefore the semaphors too...

Why we need the test object there?

For displaying it on unittest.runner._printDurations method. image

So how to fix it?

Very simple, instead of passing the test object, we just need here the test repr only!

        if hasattr(self, "collectedDurations"):
            self.collectedDurations.append((str(test), elapsed))

And this works perfect! 🚀

I'll create a PR for this @iritkatriel

iritkatriel commented 1 year ago

Good find!

encukou commented 8 months ago

That PR was a helper to make failures like this easier to diagnose. The issue itself is not fixed; it still appears in refleak tests (./python -m test test_concurrent_futures -R:).

zuliani99 commented 7 months ago

I get the same error in my case. I'm doing distributed training using spawn multiprocessing and DDP, also at the end of running my application I get a specific number of leaked semaphores.

This is the link of the issue.

I'm wondering if this issue is solved or not.