python / cpython

The Python programming language
https://www.python.org
Other
63.32k stars 30.31k forks source link

Potential regression in Python 3.11 (multiprocess shutdown?) #97641

Closed carltongibson closed 9 months ago

carltongibson commented 2 years ago

Running the Django test suite against the Python 3.11 pre-releases, we have hit a potential regression.

Steps to reproduce

git clone https://github.com/django/django.git
python3.11 -m venv env
.\env\Scripts\Activate.ps1
python -m pip install -e django
cd django\tests
python runtests.py --parallel

Bug report

On Python 3.8, 3.9, and 3.10 this runs without problem.

On Python 3.11 the following error in seen after the test suite completes, during shutdown:

Traceback (most recent call last):
  File "C:\Users\carlt\src\django\tests\runtests.py", line 773, in <module>
    failures = django_tests(
               ^^^^^^^^^^^^^
  File "C:\Users\carlt\src\django\tests\runtests.py", line 432, in django_tests
    failures = test_runner.run_tests(test_labels)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\carlt\src\django\django\test\runner.py", line 1057, in run_tests
    self.teardown_databases(old_config)
  File "C:\Users\carlt\src\django\django\test\runner.py", line 980, in teardown_databases
    _teardown_databases(
  File "C:\Users\carlt\src\django\django\test\utils.py", line 378, in teardown_databases
    connection.creation.destroy_test_db(
  File "C:\Users\carlt\src\django\django\db\backends\base\creation.py", line 307, in destroy_test_db
    self._destroy_test_db(test_database_name, verbosity)
  File "C:\Users\carlt\src\django\django\db\backends\sqlite3\creation.py", line 111, in _destroy_test_db
    os.remove(test_database_name)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'other_1.sqlite3'

This looked similar to us to open issue #95027 but we were asked to report it separately.

I've tested all the way back to a1 where (along with other issues now resolved) this error still occurs:

(django-py311) PS C:\Users\carlt\src\django\tests> python --version
Python 3.11.0rc2
...
(django-py311) PS C:\Users\carlt\src\django\tests> python --version
Python 3.11.0a1

This is somewhat frustrating as we've tried to test on all platforms since the first releases. Our test suite would only run on Windows with Python 3.11 very recently as there was a third-party dependency that was not compatible with Windows. We will try to adjust to test without dependencies as well on Windows for future versions. (Sorry about that.)

Please do let us know if we can provide further info. I imagine though the easiest thing is for you to run this yourself.

Thanks.

//cc @felixxm

Your environment

sobolevn commented 2 years ago

Based on C:\Users\carlt\src\django\django\db\backends\sqlite3\creation.py I think it might be sqlite related 🤔

corona10 commented 2 years ago

cc @erlend-aasland

erlend-aasland commented 2 years ago

I don't have access to a PC with Windows 11, so unfortunately I am unable debug this :( If someone with Windows 11 could do a bisect, that would be super helpful.

A lot of changes happened between 3.10 entered the beta phase and the first 3.11 alpha was released:

$ git log --oneline --until=2021-10-05 --since=2021-05-03 Modules/_sqlite | wc -l
56
carltongibson commented 2 years ago

If someone with Windows 11 could do a bisect, that would be super helpful.

Hi @erlend-aasland I've given this a try but I'm hitting various (A, for earlier commits) tool-chain and (B, for later commits) linker errors trying to build previous versions that (frankly) I'm not familiar enough with Windows, or building CPython for that matter, to be able to progress with in a reasonable timeframe.

I was able to reproduce the issue at 0474d06008, but I'm getting build errors by a3c11cebf1. A Windows expert would likely have more success.

(Sorry to not be of more help. Hopefully I'm a few steps along the road for the future.)

carltongibson commented 2 years ago

If someone with Windows 11 could do a bisect, that would be super helpful.

Hi @erlend-aasland I've given this a try but I'm hitting various (A, for earlier commits) tool-chain and (B, for later commits) linker errors trying to build previous versions that I'm not familiar enough with Windows, or building CPython for that matter, to be able to progress with in a reasonable timeframe.

I was able to reproduce the issue at 0474d06008, but I'm getting build errors by a3c11cebf1. A Windows expert would likely have more success.

I'm happy to keep trying, but I've reached a bit of a roadblock with building Python failing

(Sorry to not be of more help. Hopefully I'm a few steps along the road for the future.)

felixxm commented 2 years ago

Can someone add OS-Windows label? :window:

felixxm commented 2 years ago

I'm not sure if it's related :thinking: but after forcing spawn on Linux, I'm getting:

Exception ignored in: <function Pool.__del__ at 0x7f5548c9d580>
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/multiprocessing/pool.py", line 268, in __del__
ResourceWarning: unclosed running multiprocessing pool <multiprocessing.pool.Pool state=RUN pool_size=8>

/usr/local/lib/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
tkanhe commented 2 years ago

Just checked on following PC specifications: Edition: Windows 11 Home Single Language Version: 21H2 OS build: 22000.1042

with Python 3.11.0rc2, Django-4.2.dev20221005120449

Output:

`PS C:\Users\TK\Desktop> cd django\testsPS C:\Users\TK\Desktop\django\tests> python runtests.py --parallel Testing against Django installed in 'C:\Users\TK\Desktop\django\django' with up to 16 processes
Found 16023 test(s). Creating test database for alias 'default'... Cloning test database for alias 'default'... Cloning test database for alias 'default'... Cloning test database for alias 'default'... Cloning test database for alias 'default'... Cloning test database for alias 'default'... Cloning test database for alias 'default'... Cloning test database for alias 'default'... Cloning test database for alias 'default'... Cloning test database for alias 'default'... Cloning test database for alias 'default'... Cloning test database for alias 'default'... Cloning test database for alias 'default'... Cloning test database for alias 'default'... Cloning test database for alias 'default'... Cloning test database for alias 'default'... Cloning test database for alias 'default'... Creating test database for alias 'other'... Cloning test database for alias 'other'... Cloning test database for alias 'other'... Cloning test database for alias 'other'... Cloning test database for alias 'other'... Cloning test database for alias 'other'... Cloning test database for alias 'other'... Cloning test database for alias 'other'... Cloning test database for alias 'other'... Cloning test database for alias 'other'... Cloning test database for alias 'other'... Cloning test database for alias 'other'... Cloning test database for alias 'other'... Cloning test database for alias 'other'... Cloning test database for alias 'other'... Cloning test database for alias 'other'... Cloning test database for alias 'other'... System check identified no issues (17 silenced). ssssssssssssssssssssssssssssssssssssssssssssss.................................................................ss.........s.....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................s..............................................s..................................................................................................................s.....................s.....sss..ss.....................................................................................................................................................................................................................................................................................................................................................................................................s.....................................................................................................................................................................................s.................................................................s................sss...s.ss.....s..ss.sss..s......ss.s.................s........s.....................ss.....................s..s....s...s.....................s..............s......s...s...ss............s.....................

Ran 16021 tests in 74.720s

OK (skipped=1828, expected failures=5) Destroying test database for alias 'default'... Destroying test database for alias 'default'... Destroying test database for alias 'default'... Destroying test database for alias 'default'... Destroying test database for alias 'default'... Destroying test database for alias 'default'... Destroying test database for alias 'default'... Destroying test database for alias 'default'... Destroying test database for alias 'default'... Destroying test database for alias 'default'... Destroying test database for alias 'default'... Destroying test database for alias 'default'... Destroying test database for alias 'default'... Destroying test database for alias 'default'... Destroying test database for alias 'default'... Destroying test database for alias 'default'... Destroying test database for alias 'default'... Destroying test database for alias 'other'... Destroying test database for alias 'other'... Destroying test database for alias 'other'... Destroying test database for alias 'other'... Destroying test database for alias 'other'... Destroying test database for alias 'other'... Destroying test database for alias 'other'... Destroying test database for alias 'other'... Destroying test database for alias 'other'... Destroying test database for alias 'other'... Destroying test database for alias 'other'... Destroying test database for alias 'other'... Destroying test database for alias 'other'... Destroying test database for alias 'other'... Destroying test database for alias 'other'... Destroying test database for alias 'other'... Destroying test database for alias 'other'...`

*truncked some of .......sss........ output.

Let me know any other info/tests needed; I will be happy to help in the future regarding testing on Windows 11.

stic commented 2 years ago

Couldn't replicate on auto (16 cores laptop). python runtests.py --parallel Bumping it up to ~40 started hitting the sharing violation on the removal of the 'other_1.sqlite3' (weirdly only the 1st i.e. other_1). python runtests.py --parallel 40

So confirmed on latest 3.11.0rc2.

I've disabled Windows Defender on the folder under test (django) as well as %TEMP% to make sure Windows Defender it's not causing issues. That's due to fact that I've got quite a few issues with pickling (i.e. cleaning after that) e.g.

  File "c:\users\<username>\AppData\Local\Programs\Python\Python311\Lib\shutil.py", line 623, in _rmtree_unsafe
    os.rmdir(path)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\<username>\\AppData\\Local\\Temp\\django_jpnyol3z\\django_16co2dfg\\tmpcaxkg2x2\\test_project'

Not 100% sure if that's a good troubleshooting approach, but a few observations: Pass that passes manages to call CloseFile from the first process, that is the process creating the other_1.sqlite3 and it's the same process that will try to delete the db file (clean-up). Crashing test run doesn't make such CloseFile call, last syscalls seen on system level are to unlock the file. In itself that shouldn't be a problem I guess (as it's the same PID trying to close / remove the file later on). However, there is quite a gap (up to 2sec on my box) between syscalls to unlock the file and close. Can't really figure it out yet (first time with Django test suit), but if you compare vs. a run on 3.10 there is a visible wait whilst: Creating test database for alias 'other'... Working theory - we are coming back on the same PID (process) to clean-up a file, that we are still keeping open (for close operation) elsewhere?

erlend-aasland commented 2 years ago

FTR, I'm in the process of installing a Windows 11 development environment now. Will hopefully be up and running by this evening, or at least tomorrow. I'll need to delve into how to debug CPython on Windows using Visual Studio, since I have little experience with both Windows and Visual Studio.

erlend-aasland commented 2 years ago

cc. release-manager @pablogsal

erlend-aasland commented 2 years ago

Couldn't replicate on auto (16 cores laptop). python runtests.py --parallel

Ditto (Win 11, 16 cores)

Bumping it up to ~40 started hitting the sharing violation on the removal of the 'other_1.sqlite3' (weirdly only the 1st i.e. other_1). python runtests.py --parallel 40

So confirmed on latest 3.11.0rc2.

Ditto. Bumped to 40, hit the PermissionError for other_1.sqlite3.

erlend-aasland commented 2 years ago

Thanks so much for your input Eryk. AFAICS, this is not an sqlite3 bug; it is an issue with the test runner itself.

erlend-aasland commented 2 years ago

I'm for some reason unable to install Visual Studio 2015, so I'm unable to compile 3.10.0 beta1 at the moment. Thus, I'm once again unable to bisect this :(

carltongibson commented 2 years ago

This still looks to be failing with 3.11.0, so still looks like a regression from 3.10.

felixxm commented 2 years ago

I was able to fix this by adding an explicit connection close() (see https://github.com/django/django/pull/16225). A similar issue occurred in sqlite-utils, see https://github.com/simonw/sqlite-utils/issues/503. I'm not sure why it started showing up in Python 3.11 but I don't mind closing the issue.

eryksun commented 2 years ago

Thanks so much for your input Eryk. AFAICS, this is not an sqlite3 bug; it is an issue with the test runner itself.

Sorry, that comment wasn't meant for this issue. I cannot reproduce this issue to diagnose it. For me, there's a significant window of time between when the worker processes access the "*.sqlite3" files and when the main process deletes them at shutdown. If something else in the main process sometimes has the file open without delete sharing, I haven't been able to reproduce it in 3.11. (Note that any locking/unlocking of the database file that may be observed in monitoring tools has nothing to do with a sharing violation. Byte-range locks can lead to locking violations for read and write operations, but not to a sharing violation.)

That said, for projects that are using multiprocessing for parallel testing, I think it would help if multiprocessing introduced optional support for process groups and, in Windows, job objects, in order to do a better job ensuring that the entire process tree has been given a chance to gracefully exit, and subsequently to forcefully kill any remaining processes.

eryksun commented 2 years ago

I'm not sure why it started showing up in Python 3.11

3.11 switched to using functools.lru_cache for the statement cache. See #87028. This introduces a reference cycle involving the connection object, so the connection object no longer gets immediately deallocated, not until, for example, gc.collect() is called to break the cycle.

erlend-aasland commented 2 years ago

I'm not sure why it started showing up in Python 3.11

3.11 switched to using functools.lru_cache for the statement cache. See #87028. This introduces a reference cycle involving the connection object, so the connection object no longer gets immediately deallocated, not until, for example, gc.collect() is called to break the cycle.

Good point. I remember the Windows CI was unhappy in the first draft implementations of this enhancement. I'm not sure it is a bug that the connection object might live longer. Explicit resource control is advocated in the sqlite3 docs; closing connections explicitly is good practice.

eryksun commented 2 years ago

From what I can tell, the old implementation avoids the circular reference by manually decrementing the reference count of the connection object:

https://github.com/python/cpython/blob/6a1d165c4ccaffaf0872830f763c17f6d9437adf/Modules/_sqlite/connection.c#L155-L161

I suppose you could have kept the statement cache type as a skeleton that uses an lru_cache.

erlend-aasland commented 2 years ago

From what I can tell, the old implementation avoids the circular reference by manually decrementing the reference count of the connection object

True, it used that hack. Part of my motivation for getting rid of the (duplicate) LRU cache implementation in the _sqlite extension module was getting rid of this hack.

I suppose you could have kept the statement cache type as a skeleton that uses an lru_cache.

That would have been possible, but you'd keep the GC hack. I'm not sure that is a very good idea.

erlend-aasland commented 1 year ago

I was able to fix this by adding an explicit connection close() (see django/django#16225). A similar issue occurred in sqlite-utils, see simonw/sqlite-utils#503. I'm not sure why it started showing up in Python 3.11 but I don't mind closing the issue.

Following gh-108015, sqlite3 will now emit a ResourceWarning if connection objects are closed implicitly. Explicit resource management is recommended. cc. @eryksun

erlend-aasland commented 1 year ago

gh-108015 has now landed; Python 3.13 will emit a ResourceWarning if a connection object is closed implicitly. Explicit resource management is now recommended. @felixxm has reported that Django already started to clean up their test suite (django/django#17178).

It seems to me the problems mentioned in this issue can all be handled by improved resource management. Suggesting to close this.

erlend-aasland commented 9 months ago

Closing as per https://github.com/python/cpython/issues/97641#issuecomment-1688059764.