Open furkanonder opened 1 year ago
The leakage is from FailingInitializerMixin
. Don't know how to fix though.
Two failing tests:
test.test_concurrent_futures.ProcessPoolForkserverFailingInitializerTest.test_initializer
test.test_concurrent_futures.ProcessPoolSpawnFailingInitializerTest.test_initializer
git bisect
that the first failed commit is 6883007a86bdf0d7cf4560b949fd5e577dab1013, https://github.com/python/cpython/pull/12271git bisect
again to find the specific problematic file unittest/result.py
and this is the git diff
diff --git a/Lib/unittest/result.py b/Lib/unittest/result.py
index 5ca4c23238..fa9bea47c8 100644
--- a/Lib/unittest/result.py
+++ b/Lib/unittest/result.py
@@ -43,6 +43,7 @@ def __init__(self, stream=None, descriptions=None, verbosity=None):
self.skipped = []
self.expectedFailures = []
self.unexpectedSuccesses = []
+ self.collectedDurations = []
self.shouldStop = False
self.buffer = False
self.tb_locals = False
@@ -157,6 +158,12 @@ def addUnexpectedSuccess(self, test):
"""Called when a test was expected to fail, but succeed."""
self.unexpectedSuccesses.append(test)
def wasSuccessful(self): """Tells whether or not this result was a success."""
5. Indeed when I put a return before `self.collectedDurations.append((test, elapsed))` , all is OK
Ran 1 test in 0.398s
OK
4. I found out with some debugging and print that on the failed case (`collectedDurations.append`) the `resource_tracker` is called before the `sem_unlink` which will try to unlink them later and fail, since the resource tracker already did it. Only 3 semaphore objects are failing here, trying to unlink **after** `addDuration` function.
case.py(638)run() -> self._callTearDown() -> run sem_unlink(name) on /mp-4lu9pdgs case.py(638)run() -> self._callTearDown() -> run sem_unlink(name) on /mp-61ynpjfw case.py(638)run() -> self._callTearDown() -> run sem_unlink(name) on /mp-wrenmqfa case.py(638)run() -> self._callTearDown() -> run sem_unlink(name) on /mp-e01eptc5 case.py(638)run() -> self._callTearDown() -> run sem_unlink(name) on /mp-sssjxme9 0.27s run addDuration Lib/multiprocessing/resource_tracker.py:236: UserWarning: resource_tracker: There appear to be 3 leaked semaphore objects to clean up at shutdown call _CLEANUP_FUNCS semaphore function for /mp-u8k6h3v7 file call _CLEANUP_FUNCS semaphore function for /mp-ihqrt4f2 file call _CLEANUP_FUNCS semaphore function for /mp-n4h31gcc file suite.py(132)run() -> self._handleModuleTearDown(result) -> run sem_unlink(name) on /mp-n4h31gcc FileNotFoundError: [Errno 2] No such file or directory suite.py(132)run() -> self._handleModuleTearDown(result) -> run sem_unlink(name) on /mp-u8k6h3v7 FileNotFoundError: [Errno 2] No such file or directory suite.py(132)run() -> self._handleModuleTearDown(result) -> run sem_unlink(name) on /mp-ihqrt4f2 FileNotFoundError: [Errno 2] No such file or directory
(reduced the output so it will be clearer)
5. In other hand, when I ignore this line, the `sem_ulink` for those 3 semaphore objects are called suceesfully after the `addDuration` function *but from another trace*
case.py(638)run() -> self._callTearDown() -> run sem_unlink(name) on /mp-ch65p5qm case.py(638)run() -> self._callTearDown() -> run sem_unlink(name) on /mp-z2m0y_8x case.py(638)run() -> self._callTearDown() -> run sem_unlink(name) on /mp-mi_764q8 case.py(638)run() -> self._callTearDown() -> run sem_unlink(name) on /mp-ntvllres case.py(638)run() -> self._callTearDown() -> run sem_unlink(name) on /mp-f5553813 0.27s run addDuration .suite.py(122)test(result) -> suite.py(84)return self.run(*args, kwds) -> run sem_unlink(name) on /mp-jvota4jy suite.py(122)test(result) -> suite.py(84)return self.run(*args, *kwds) -> run sem_unlink(name) on /mp-hxf6fzql suite.py(122)test(result) -> suite.py(84)return self.run(args, kwds) -> run sem_unlink(name) on /mp-fusub384
6. I'm trying now to understand what caused the `resource_tracker` to trigger befure the `sem_unlink`. For now I think that for some reason the `sem_unlink` from `TestSuite` line 122 is not reached, and thefore the resouce tracker is trying to release the resources. And later the call to `sem_unlink` from `TestSuite` line 132 is trying to run and fails
![image](https://github.com/python/cpython/assets/20026896/91175912-ef3c-4a0d-90c3-874d8758fa1f)
7. The key is to understand why when the `self.collectedDurations` is not empty, it blocks us from cleaning as usual those 3 semaphore objects
Ok, found more -
The leaked semaphore objects are from FailingInitializerMixin
(as @sunmy2019 mentioned before, but now it's clearer for me).
This line creates a new queue with 3 semaphore related and when the collectedDurations
is not empty the self.log_queue
is not released (still in used I guess)
################
FailingInitializerMixin.setUp: Calling self.mp_context.Queue()
Queue __init__: Set self._rlock = ctx.Lock()
Add semlock to register and finalize self._semlock.name='/mp-lnmmngl0', self.__class__.__name__='Lock'
Queue __init__: Set self._wlock = ctx.Lock()
Add semlock to register and finalize self._semlock.name='/mp-5fhxbfsv', self.__class__.__name__='Lock'
Queue __init__: Set ctx.BoundedSemaphore(maxsize)
Add semlock to register and finalize self._semlock.name='/mp-8mzh94sd', self.__class__.__name__='BoundedSemaphore'
################
Still investigating
For some additional context, I also ran git bisect
and landed on the same commit identified above. I did not get as far as the post above me did in terms of identifying the exact reasons why the 6 leaked semaphore objects were not cleaned up before shutdown, but I also narrowed down the problem and found the 6 leaked semaphore objects
warning goes away when the following block (also mentioned above):
def addDuration(self, test, elapsed):
"""Called when a test finished to run, regardless of its outcome."""
# support for a TextTestRunner using an old TestResult class
if hasattr(self, "collectedDurations"):
self.collectedDurations.append((test, elapsed))
is completely commented out:
# def addDuration(self, test, elapsed):
# """Called when a test finished to run, regardless of its outcome."""
# # support for a TextTestRunner using an old TestResult class
# if hasattr(self, "collectedDurations"):
# self.collectedDurations.append((test, elapsed))
Another thing that I found interesting was the leaked semaphore objects look like TemporaryFile
s:
/mp-ch65p5qm
/mp-z2m0y_8x
/mp-mi_764q8
/mp-ntvllres
/mp-f5553813
However, I'm not sure about this, since it does not look like TemporaryFile
is called anywhere in test_concurrent_futures.py
except for https://github.com/python/cpython/blob/89867d2491c0c3ef77bc237899b2f0762f43c03c/Lib/test/test_concurrent_futures.py#L1248-L1249 but this does not look relevant for the problem here (but I'm adding this observation here in case it is relevant).
@shailshouryya
Not related to TemporaryFile, those are virtual files used by each semaphore object as the system locking mechanisem
Accessing named semaphores via the file system On Linux, named semaphores are created in a virtual file system, normally mounted under /dev/shm
Source: https://linux.die.net/man/7/sem_overview
For example, if you stop the process in the middle
you will be able to see those virtual files here
which will be deleted on _multiprocessing.sem_unlink
method
Ok, found how to fix the issue! 🚀
The current code calls self.collectedDurations.append((test, elapsed))
on addDuration
method (as wrote in previous comment).
This causes the 3 sempahors of FailingInitializerMixin.log_queue
to not being released before the resource tracker
is stopped and clear by itself all left resources.
.collectedDurations.append
), the ResourceTracker
is stopped after we clear the 3 semaphors -ResourceTracker
is stopped before we clear the 3 semaphors -This happens since we call _run_finalizers()
after resource_tracker._resource_tracker._stop()
So, optional solution No.1 will be to change the order of the calls, like this -
And it solves the issue perfectly
@vstinner - I saw that you were the one who wrote this part, Can we do such change? If yes, I'll make a PR for it.
@iritkatriel - I'll happy to here what you think here, thanks.
Still open question for me, is why the objects are not finalize normally (by the gc I assume) like it was before. I'm still investigating on this
Hey,
Found better solution and also what's happening here.
The reason those 3 semaphores were still alive and not deleted, it's because there weren't garabage collected yet, since they have a reference.
How?
Those 3 semaphors are related to FailingInitializerMixin.log_queue
object (they are created as part of this queue).
This mixin is used to create the test class ProcessPoolSpawnFailingInitializerTest
which was used on the test result private object collectedDurations
, and since the result object continue to live after the test object ended, the test object can't be collected and therefore the semaphors too...
Why we need the test object there?
For displaying it on unittest.runner._printDurations
method.
So how to fix it?
Very simple, instead of passing the test object, we just need here the test repr only!
if hasattr(self, "collectedDurations"):
self.collectedDurations.append((str(test), elapsed))
And this works perfect! 🚀
I'll create a PR for this @iritkatriel
Good find!
That PR was a helper to make failures like this easier to diagnose.
The issue itself is not fixed; it still appears in refleak tests (./python -m test test_concurrent_futures -R:
).
Bug report
Your environment
Linked PRs