python / cpython

The Python programming language
https://www.python.org
Other
63.22k stars 30.27k forks source link

subprocess.Popen() hangs: parent process hangs at os.read(errpipe_read) and child hangs at _posixsubprocess.fork_exec() #120998

Open EricLin89 opened 4 months ago

EricLin89 commented 4 months ago

Bug report

Bug description:

I have a python program (let's call it my_prog.py) that starts multiple threads of periodic tasks in one process. Some periodic tasks involvs writing to a file and calls "setfacl" command after the file is written.

I recently observed one of the threads hangs at subprocess.Popen() when trying to calling setfacl command. I managed to print out the stacktrace of all the threads by using py-spy dump. Here are the detailed infos:

  1. There are 2 my_prog.py processes exists, they are parent and child processes, while normally there should only be 1.
  2. I dumped out the 2 processes' stacktrace, it looks like (The python version is 3.9.9):
    
    parent process:

Thread 44167 (idle): "Thread-get-lldp-ip"

_execute_child (subprocess.py:1777)

__init__ (subprocess.py:951)

... some irrelevant stacktraces...

run (xxx.py: xxx)

_bootstrap_inner (threading.py:980)

_bootstrap (threading.py:937)

child process:

Thread 89047 (idle): "Thread-get-lldp-ip"

_execute_child (subprocess.py:1754)

__init__ (subprocess.py:951)

... some irrelevant stacktraces...

run (xxx.py: xxx)

_bootstrap_inner (threading.py:980)

_bootstrap (threading.py:937)
so the parent process is waiting to read from errpipe_read as https://github.com/python/cpython/blob/ccb0e6a3452335a4c3e2433934c3c0c5622a34cd/Lib/subprocess.py#L1777
the child process however got stuck at _posixsubprocess.fork_exec() as https://github.com/python/cpython/blob/ccb0e6a3452335a4c3e2433934c3c0c5622a34cd/Lib/subprocess.py#L1754

3. looking into _posixsubprocess.c,the fork_exec() function does some prepation work and then forks a child process and calls execv() to invoke the actual command called. So I assume the child process is forked in the above procedure but somehow stucks before execv() as the child's process name is still the parent's process name rather than the actual command called.

4. some other infos I dont know if it's related to the problem: 
    a) The stuck subprocess.Popen() call is inside a thread lock, and other threads got blocked acquiring this lock since subprocess.Popen() is stuck. I've read some posts on lock inheritance or what, but I don't know if it's related.
    b) When the subprocess is forked, there's another thread calling subprocess and is waiting on communicate(), as seen in child process's stacktrace:

Thread 0x14F2A5F37640 (idle): "ComponentManager"

select (selectors.py:416)

_communicate (subprocess.py:1979)

communicate (subprocess.py:1134)

... some irrelevant stacktraces...

_bootstrap_inner (threading.py:980)

_bootstrap (threading.py:937)

while in parent process, this "ComponentManager" thread has continued running.

This issue only appeared once, and I didn't manage to reproduce after that.

Based on the above infos, could someone give some clues on what may be the cause? Thanks.

### CPython versions tested on:

3.9

### Operating systems tested on:

Linux
shmuelzon commented 2 months ago

Out of curiosity, are you using preexec_fn in your Popen() call? Also, do you have threads in your application?

EricLin89 commented 2 months ago

Out of curiosity, are you using preexec_fn in your Popen() call? Also, do you have threads in your application?

  1. Yes, preexec_fn is used, the function passed to preexec_fn is as follows:
def _switch_user(self):
    if self.group_id:
        os.setgid(self.group_id)
    if self.user_id:
        os.setuid(self.user_id)

which allows the caller to switch to a specific user when calling the command. In this case, the user_id and group_id are not specified, so preexec_fn is passed but didn't actually do anything.

  1. Yes, the application is a multi threaded program and the hanging just happened in one of the threads that runs a periodic task.

Do you know any known issues related to use cases when using with preexec_fn and threads? @shmuelzon

shmuelzon commented 2 months ago

I ran into a similar issue myself, found this ticket without any resolution and started digging deeper.

Do you know any known issues related to use cases when using with preexec_fn and threads?

Apparently, yes. And it's stated in many places, had I bothered to RTFM. See the Popen documentation:

Warning The preexec_fn parameter is NOT SAFE to use in the presence of threads in your application. The child process could deadlock before exec is called.

There are also subtle hints in the Python code which basically say "using preexec_fn will cause a deadlock but you asked for it". This Stack Overflow answer explains pretty well why this is a problem.

EricLin89 commented 2 months ago

I ran into a similar issue myself, found this ticket without any resolution and started digging deeper.

Do you know any known issues related to use cases when using with preexec_fn and threads?

Apparently, yes. And it's stated in many places, had I bothered to RTFM. See the Popen documentation:

Warning The preexec_fn parameter is NOT SAFE to use in the presence of threads in your application. The child process could deadlock before exec is called.

There are also subtle hints in the Python code which basically say "using preexec_fn will cause a deadlock but you asked for it". This Stack Overflow answer explains pretty well why this is a problem.

Ah... Thanks. I tried to look for known deadlock issues all over the internet but failed to notice that obvious warning on the official documentation... I think I would fix the problem by removing the use of preexec_fn... Again, thank you so much. @shmuelzon

wouterdb commented 4 days ago

I have code calling into pip and it sometimes gets stuck like this . I don't use preexec_fn.

I am using asyncio.subprocess_shell , but this code path is shared between sync and async code.


self = <Popen: returncode: None args: '/tmp/pytest-of-jenkins/pytest-30443/test_exe...>
args = ['/bin/sh', '-c', '/tmp/pytest-of-jenkins/pytest-30443/test_executor_creation_and_ven0/executors/venv/dac033c09055cd63...ython -m pip install pkg2 inmanta-core==15.0.0.dev0 --index-url /tmp/pytest-of-jenkins/pytest-30443/pip_index0/simple']
executable = b'/bin/sh', preexec_fn = None, close_fds = True, pass_fds = ()
cwd = '/'
env = {'ASYNC_TEST_TIMEOUT': '60', ...}
startupinfo = None, creationflags = 0, shell = True, p2cread = -1, p2cwrite = -1
c2pread = 27, c2pwrite = 106, errread = -1, errwrite = 106
restore_signals = True, gid = None, gids = None, uid = None, umask = -1
start_new_session = False, process_group = -1

    def _execute_child(self, args, executable, preexec_fn, close_fds,
                       pass_fds, cwd, env,
                       startupinfo, creationflags, shell,
                       p2cread, p2cwrite,
                       c2pread, c2pwrite,
                       errread, errwrite,
                       restore_signals,
                       gid, gids, uid, umask,
                       start_new_session, process_group):
        """Execute program (POSIX version)"""

        if isinstance(args, (str, bytes)):
            args = [args]
        elif isinstance(args, os.PathLike):
            if shell:
                raise TypeError('path-like args is not allowed when '
                                'shell is true')
            args = [args]
        else:
            args = list(args)

        if shell:
            # On Android the default shell is at '/system/bin/sh'.
            unix_shell = ('/system/bin/sh' if
                      hasattr(sys, 'getandroidapilevel') else '/bin/sh')
            args = [unix_shell, "-c"] + args
            if executable:
                args[0] = executable

        if executable is None:
            executable = args[0]

        sys.audit("subprocess.Popen", executable, args, cwd, env)

        if (_USE_POSIX_SPAWN
                and os.path.dirname(executable)
                and preexec_fn is None
                and not close_fds
                and not pass_fds
                and cwd is None
                and (p2cread == -1 or p2cread > 2)
                and (c2pwrite == -1 or c2pwrite > 2)
                and (errwrite == -1 or errwrite > 2)
                and not start_new_session
                and process_group == -1
                and gid is None
                and gids is None
                and uid is None
                and umask < 0):
            self._posix_spawn(args, executable, env, restore_signals,
                              p2cread, p2cwrite,
                              c2pread, c2pwrite,
                              errread, errwrite)
            return

        orig_executable = executable

        # For transferring possible exec failure from child to parent.
        # Data format: "exception name:hex errno:description"
        # Pickle is not used; it is complex and involves memory allocation.
        errpipe_read, errpipe_write = os.pipe()
        # errpipe_write must not be in the standard io 0, 1, or 2 fd range.
        low_fds_to_close = []
        while errpipe_write < 3:
            low_fds_to_close.append(errpipe_write)
            errpipe_write = os.dup(errpipe_write)
        for low_fd in low_fds_to_close:
            os.close(low_fd)
        try:
            try:
                # We must avoid complex work that could involve
                # malloc or free in the child process to avoid
                # potential deadlocks, thus we do all this here.
                # and pass it to fork_exec()

                if env is not None:
                    env_list = []
                    for k, v in env.items():
                        k = os.fsencode(k)
                        if b'=' in k:
                            raise ValueError("illegal environment variable name")
                        env_list.append(k + b'=' + os.fsencode(v))
                else:
                    env_list = None  # Use execv instead of execve.
                executable = os.fsencode(executable)
                if os.path.dirname(executable):
                    executable_list = (executable,)
                else:
                    # This matches the behavior of os._execvpe().
                    executable_list = tuple(
                        os.path.join(os.fsencode(dir), executable)
                        for dir in os.get_exec_path(env))
                fds_to_keep = set(pass_fds)
                fds_to_keep.add(errpipe_write)
                self.pid = _fork_exec(
                        args, executable_list,
                        close_fds, tuple(sorted(map(int, fds_to_keep))),
                        cwd, env_list,
                        p2cread, p2cwrite, c2pread, c2pwrite,
                        errread, errwrite,
                        errpipe_read, errpipe_write,
                        restore_signals, start_new_session,
                        process_group, gid, gids, uid, umask,
                        preexec_fn, _USE_VFORK)
                self._child_created = True
            finally:
                # be sure the FD is closed no matter what
                os.close(errpipe_write)

            self._close_pipe_fds(p2cread, p2cwrite,
                                 c2pread, c2pwrite,
                                 errread, errwrite)

            # Wait for exec to fail or succeed; possibly raising an
            # exception (limited in size)
            errpipe_data = bytearray()
            while True:
>               part = os.read(errpipe_read, 50000)
E               KeyboardInterrupt