python / typeshed

Collection of library stubs for Python, with static types
Other
4.37k stars 1.75k forks source link

A race condition in `mypy_test.py` #9537

Closed AlexWaygood closed 1 year ago

AlexWaygood commented 1 year ago

9408 (my PR) appears to have introduced a race condition in mypy_test.py.

Evidence of the race

We've had several "flukey" errors in CI over the last few days, all with the same traceback:

The traceback is the same every time: ```pytb Testing third-party packages... Traceback (most recent call last): File "./tests/mypy_test.py", line 557, in main() File "./tests/mypy_test.py", line 544, in main code, files_checked_this_version = test_typeshed(code, args=config, tempdir=td_path) File "./tests/mypy_test.py", line 525, in test_typeshed code, third_party_files_checked = test_third_party_stubs(code, args, tempdir) File "./tests/mypy_test.py", line 500, in test_third_party_stubs setup_virtual_environments(distributions_to_check, args, tempdir) File "./tests/mypy_test.py", line 444, in setup_virtual_environments requirements_set, venv_info = venv_info_future.result() File "/opt/hostedtoolcache/Python/3.7.15/x64/lib/python3.7/concurrent/futures/_base.py", line 428, in result return self.__get_result() File "/opt/hostedtoolcache/Python/3.7.15/x64/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result raise self._exception File "/opt/hostedtoolcache/Python/3.7.15/x64/lib/python3.7/concurrent/futures/thread.py", line 57, in run result = self.fn(*self.args, **self.kwargs) File "./tests/mypy_test.py", line 384, in setup_venv_for_external_requirements_set return requirements_set, make_venv(venv_dir) File "/home/runner/work/typeshed/typeshed/tests/utils.py", line 150, in make_venv venv.create(venv_dir, with_pip=True, clear=True) File "/opt/hostedtoolcache/Python/3.7.15/x64/lib/python3.7/venv/__init__.py", line 390, in create builder.create(env_dir) File "/opt/hostedtoolcache/Python/3.7.15/x64/lib/python3.7/venv/__init__.py", line 68, in create self._setup_pip(context) File "/opt/hostedtoolcache/Python/3.7.15/x64/lib/python3.7/venv/__init__.py", line 288, in _setup_pip subprocess.check_output(cmd, stderr=subprocess.STDOUT) File "/opt/hostedtoolcache/Python/3.7.15/x64/lib/python3.7/subprocess.py", line 411, in check_output **kwargs).stdout File "/opt/hostedtoolcache/Python/3.7.15/x64/lib/python3.7/subprocess.py", line 488, in run with Popen(*popenargs, **kwargs) as process: File "/opt/hostedtoolcache/Python/3.7.15/x64/lib/python3.7/subprocess.py", line 800, in __init__ restore_signals, start_new_session) File "/opt/hostedtoolcache/Python/3.7.15/x64/lib/python3.7/subprocess.py", line 1551, in _execute_child raise child_exception_type(errno_num, err_msg, err_filename) OSError: [Errno 26] Text file busy: '/tmp/tmp45sugieh/.venv-5354540476898/bin/python' ```

Cause of the race

The race happens in this part of the test, where multiple virtual environments are set up concurrently using a threadpool.

https://github.com/python/typeshed/blob/0a291da2f08d90604e7eba80abe1e3f64a8a10a4/tests/mypy_test.py#L438-L445

However, it's unclear exactly why the race occurs. The traceback indicates that two threads appear to be attempting to access the same Python executable at the same time. But it's hard to see why that would be the case.

Reproducing the race condition

@JelleZijlstra has managed to reproduce the race locally on a Linux machine using Python 3.7. This rules out any GitHub Actions-specific hypotheses for why the race might be happening.

I have not been able to reproduce the failure in my local environment (Windows, Python 3.10). (I've tried creating a venv for 145 stubs packages in a threadpool with 20 workers -- still no race for me.)

Note that mypy_test.py is only ever run on Ubuntu in CI.

Fixing the issue

Idea (1): Just create the venvs one-at-a-time

You can't have a race condition if you don't try to do things concurrently. Locally, this makes the test much slower for me -- the time taken to create the venvs goes up from around 20 seconds to around 60 seconds. The specific part of creating a venv that is slow is the venv.EnvBuilder._setup_pip function in the stdlib -- the exact part that appears to be implicated in the race condition.

However, if we did go this route, we could potentially mitigate the performance regression by caching venvs between runs of the test (both locally and in CI).

Idea (2): Do concurrency differently

Instead of using a threadpool, we could manually create our threads. Or we could move to an asyncio-based concurrency model.

Either of these might fix the race condition, but it's hard to know if they would without knowing exactly what's causing the race condition.

Idea (3): Paper over the problem

This diff would probably "fix" the problem, but it wouldn't be a particularly principled fix:

```diff diff --git a/tests/mypy_test.py b/tests/mypy_test.py index 087f41366..a0a66ed31 100644 --- a/tests/mypy_test.py +++ b/tests/mypy_test.py @@ -381,7 +381,17 @@ _DISTRIBUTION_TO_VENV_MAPPING: dict[str, VenvInfo] = {} def setup_venv_for_external_requirements_set(requirements_set: frozenset[str], tempdir: Path) -> tuple[frozenset[str], VenvInfo]: venv_dir = tempdir / f".venv-{hash(requirements_set)}" - return requirements_set, make_venv(venv_dir) + while True: + try: + venv_info = make_venv(venv_dir) + except OSError as e: + if e.errno != 26: + raise + else: + time.sleep(0.5) + else: + break + return requirements_set, venv_info ```
srittau commented 1 year ago

We could try switching to ProcessPoolExecutor and see if that helps.

AlexWaygood commented 1 year ago

We could try switching to ProcessPoolExecutor and see if that helps.

Implemented in #9556. I'll hold off work on trying to cache venvs between runs for a while, so we can test this change in isolation and see if the problem is actually fixed. Caching venvs between runs would be good to do anyway, though (and not just for this test -- stubtest_third_party would benefit as well), so I'll work on that after we've waited a while.

@JelleZijlstra, could you possibly also see if you can still reproduce the race after #9556?

JelleZijlstra commented 1 year ago

Running it again now.

JelleZijlstra commented 1 year ago

I ran it for a few hours and saw no failures. I think the issue is fixed!

Maybe there really is a bug in the ThreadPoolExecutor.

JelleZijlstra commented 1 year ago

The traceback indicates that two threads appear to be attempting to access the same Python executable at the same time.

More precisely, it indicates that we're trying to execute the Python executable while someone has the file open for writing. I looked at the code for venv and it's not clear how that is possible.

AlexWaygood commented 1 year ago

Thanks v much @JelleZijlstra for helping debug this, and thanks @srittau for suggesting the fix!

I was pretty pleased with how easy it was to swap in ProcessPoolExecutor here, I've got to say. I've had cause to use ThreadPoolExecutor before, but never ProcessPoolExecutor before now.