Subprocess suffers 40% process creation overhead penalty

2dbb74b2-df2b-4dba-a2ed-a03b8f933fd0 commented 13 years ago

BPO	11314
Nosy	@gpshead, @pitrou, @vstinner, @giampaolo
Files	subprocess_errpipe_buffer.patch bench_subprocess.py

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = created_at = labels = ['library', 'performance'] title = 'Subprocess suffers 40% process creation overhead penalty' updated_at = user = 'https://bugs.python.org/AaronSherman' ``` bugs.python.org fields: ```python activity = actor = 'pitrou' assignee = 'none' closed = True closed_date = closer = 'vstinner' components = ['Library (Lib)'] creation = creator = 'Aaron.Sherman' dependencies = [] files = ['20975', '20976'] hgrepos = [] issue_num = 11314 keywords = ['patch'] message_count = 24.0 messages = ['129319', '129320', '129321', '129333', '129350', '129377', '129379', '129397', '129450', '129633', '129878', '129879', '129880', '129881', '129883', '129884', '129885', '129887', '129888', '129889', '129890', '129893', '129919', '129920'] nosy_count = 6.0 nosy_names = ['gregory.p.smith', 'pitrou', 'vstinner', 'giampaolo.rodola', 'neologix', 'Aaron.Sherman'] pr_nums = [] priority = 'normal' resolution = 'fixed' stage = None status = 'closed' superseder = None type = 'resource usage' url = 'https://bugs.python.org/issue11314' versions = ['Python 2.6', 'Python 2.7'] ```

2dbb74b2-df2b-4dba-a2ed-a03b8f933fd0 commented 13 years ago

I wrote some code a while back which used os.popen. I recently got a warning about popen being deprecated so I tried a test with the new subprocess module. In that test, subprocess.Popen appears to have a 40% process creation overhead penalty over os.popen, which really isn't small. It seems that the difference may be made up of some heavy mmap-related work that's happening in my version of python, and that might be highly platform specific, but the mmap/mremap/munmap calls being made in my sample subprocess code aren't being made at all in the os.popen equivalent.

Now, before someone says, "process creation is trivial, so a 40% hit is acceptable because it's 40% of a trivial part of your execution time," keep in mind that many Python applications are heavily process-creation focused. In my case that means monitoring, but I could also imagine this having a substantial impact on Web services and other applications that spend almost all of their time creating child processes. For a trivial script, subprocess is fine as is, but for these demanding applications, subprocess represents a significant source of pain.

Anyway my testing, results and conclusions are written up here:

http://essays.ajs.com/2011/02/python-subprocess-vs-ospopen-overhead.html

vstinner commented 13 years ago

Python 3.2 has a _posixsubprocess: some parts of subprocess are implemented in C. Can you try it?

Python 3.2 uses also pipe2(), if available, to avoid the extra fcntl(4, F_GETFD)+fcntl(4, F_SETFD, FD_CLOEXEC).

I suppose that the pipe and mmap(NULL, 1052672, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) is used by subprocess to transfer a Python exception from the child process. Python 2.7 encodes the exception value, type and traceback using pickle, and the parent process calls os.read(1048576). Python 3.2 only encodes the exception value and type using a simple string, the parent process uses bytearray with chunks of 50,000 bytes (stop when the total size is bigger than 50,000 bytes). So I suppose that Python 3.2 allocates less memory in the parent process to read the child exception (if any).

You may also try to change Popen buffer size, but it should not change anything if you test "exit 0".

pitrou commented 13 years ago

I think your analysis is wrong. These mmap() calls are for anonymous memory, most likely they are emitted by the libc's malloc() to get some memory from the kernel. In other words they will be blazingly fast.

I would suggest you try to dig deeper. For example, how much CPU time does the parent process take (excluding its children).

Of course, I also disagree with the idea that spawning "exit 0" subprocesses is a performance critical operation ;) Therefore, it would be useful to know the absolute overhead difference (in milliseconds) between subprocess and os.popen(), to decide if there is really a problem.

2dbb74b2-df2b-4dba-a2ed-a03b8f933fd0 commented 13 years ago

"Python 3.2 has a _posixsubprocess: some parts of subprocess are implemented in C. Can you try it?"

I don't have a Python 3 installation handy, but I can see what I can do tomorrow evening to get one set up and try it out.

"disagree with the idea that spawning "exit 0" subprocesses is a performance critical operation ;)"

How else would you performance test process creation overhead? By introducing as little additional overhead as possible, it's possible for me to get fairly close to measuring just the subprocess module's overhead.

If you stop to think about it, though, this is actually a shockingly huge percent increase. In any process creation scenario I'm familiar with, its overhead should be so small that you could bump it up several orders of magnitude and still not compete with executing a shell and asking it to do anything, even just exit.

And yet, here we are. 40%

I understand that most applications won't be running massive numbers of external commands in parallel, and that's the only way this overhead will really matter (at least that I can think of). But in the two scenarios I mentioned (monitoring and Web services such as CGI, neither of which is particularly rare), this is going to make quite a lot of difference, and if you're going to deprecate os.popen, I would think that making sure your proposed replacement was at least nearly as performant would be standard procedure, no?

"I think your analysis is wrong. These mmap() calls are for anonymous memory, most likely they are emitted by the libc's malloc() to get some memory from the kernel. In other words they will be blazingly fast."

The mremap might be a bit of a performance hit, but it's true that these calls should not be substantially slowing execution... then again, they might indicate that there's substantial amounts of work being done for which memory allocation is required, and as such may simply be a symptom of the actual problem.

79528080-9d85-4d18-8a2a-8b1f07640dd7 commented 13 years ago

If you stop to think about it, though, this is actually a shockingly huge percent increase. In any process creation scenario I'm familiar with, its overhead should be so small that you could bump it up several orders of magnitude and still not compete with executing a shell and asking it to do anything, even just exit.

os.popen just calls the popen(3) library call, which just performs a fork/execve and some dup/close in between. subprocess.Popen is implemented in Python, so it doesn't come as a surprise that it's slower in your example. But I think there's a point you're missing: you're not just creating a child process, but a shell. subprocess is actually faster than popen to spawn executables (not scripts), because you don't have to spawn a shell.

For example:

$ cat /tmp/test.c 
int main(int argc, char *argv[])
{
        return 0;
}

$ cat /tmp/test_subprocess.py 
import subprocess

for i in range(10000):
    f = subprocess.Popen('/tmp/test')
    f.wait()

$ cat /tmp/test_popen.py 
import os

for i in range(10000):
    f = os.popen('/tmp/test')
    f.close()

$ time ./python /tmp/test_subprocess.py

real 0m13.933s user 0m3.083s sys 0m12.441s

$ time ./python /tmp/test_popen.py

real 0m18.235s user 0m4.293s sys 0m15.176s

Given of important the subprocess overhead seems to you, I guess that the processes you're launching are not shell scripts, and thus you're probably better off using subprocess.

pitrou commented 13 years ago

But in the two scenarios I mentioned (monitoring and Web services such as CGI, neither of which is particularly rare), this is going to make quite a lot of difference

That's why I asked for absolute numbers for the overhead difference. A percentage doesn't tell me if launching a process costs 50µs more or 500ms more.

The only question is whether process launch overhead is significant compared to total process runtime (for non-trivial work, that is, not "echo 0" ;-)).

pitrou commented 13 years ago

Okay, thanks to Charles-François' measurements, we can deduce that each subprocess launch is at most 0.3ms of user CPU time and 1.2ms of system CPU time. IMO that's not a real problem.

gpshead commented 13 years ago

fyi - while I still need to update it now that 3.2 final has been released, the http://code.google.com/p/python-subprocess32/ project has a backport of the _posixsubprocess stuff from python 3.2 for use on Python 2.x.

2dbb74b2-df2b-4dba-a2ed-a03b8f933fd0 commented 13 years ago

"That's why I asked for absolute numbers for the overhead difference."

Did you not follow the link in my first post? I got pretty detailed, there.

"os.popen just calls the popen(3) library call, which just performs a fork/execve and some dup/close in between. subprocess.Popen is implemented in Python, so it doesn't come as a surprise that it's slower in your example."

Well, of course. I don't think I was ever trying to claim that os.popen vs. subprocess without a shell was going to compare favorably. I'm not advocating os.popen, here, I'm just trying to figure out where this massive overhead is coming from. I think the answer is just, "pure Python is fundamentally slower, and that's not a surprise."

Now, if the 3.x subprocess work that was described here, gets back-ported into 2.x and is included with future releases, that will definitely serve to improve the situation, and might well render much of this moot (testing will tell).

However, I do think that doing the performance testing before deprecating the previous interface would have been a good idea...

79528080-9d85-4d18-8a2a-8b1f07640dd7 commented 13 years ago

Aaron Sherman \ajs@ajs.com\ added the comment:

"That's why I asked for absolute numbers for the overhead difference."

Did you not follow the link in my first post? I got pretty detailed, there.

By the way, strace has a '-T' flag that can be used to measure the time spent in syscalls (but for a mmap/mremap/munmap it won't be significant).

However, I do think that doing the performance testing before deprecating the previous interface would have been a good idea...

Once again: subprocess.Popen is actually faster than os.pipe to create a subprocess. Just don't pass shell=True. Unless you're spawning shell scripts, but in that case subprocess' overhead is definitely not biggest problem.

vstinner commented 13 years ago

subprocess_errpipe_buffer.patch: Patch for 2.7 which improve by 30% the dummy benchmark (run /bin/false 10,000 times). It avoids the creation of the 1 MB buffer: start with a buffer of a single byte (just to check if there is data or if the pipe was closed).

pitrou commented 13 years ago

subprocess_errpipe_buffer.patch: Patch for 2.7 which improve by 30% the dummy benchmark (run /bin/false 10,000 times). It avoids the creation of the 1 MB buffer: start with a buffer of a single byte (just to check if there is data or if the pipe was closed).

Does it improve on non-dummy benchmarks? Otherwise I don't think it's worth the trouble.

vstinner commented 13 years ago

Quick benchmark:

fork(), execv(), waitpid(): 19081.1 ms
os.popen(): 25769.8 ms
subprocess.Popen(): 40025.8 ms
subprocess.Popen() patched: 26155.2 ms

I tested Python 2.7 (debug build) on Linux (Debian Sid) with a Pentium 4 (Hyperthreading, 2 fake cores), 2 GB of memory.

Without the patch, subprocess is 55% slower than popen. With the patch, subprocess is 1% slower than popen. It is not a good news: subprocess should be faster before my benchmark tries Popen() without stdout=PIPE ;-)

pitrou> I think your analysis is wrong. These mmap() calls are for pitrou> anonymous memory, most likely they are emitted by the libc's pitrou> malloc() to get some memory from the kernel. In other words pitrou> they will be blazingly fast.

Are you sure? :-)

Note that popen() is 35% slower than fork()+execv()+waitpid(). Patched subprocess is 37% slower than fork()+execv()+waitpid().

Try attached bench_subprocess.py script to get your own results.

pitrou commented 13 years ago

Victor, your patch doesn't even apply on 3.x. That code doesn't exist anymore...

pitrou commented 13 years ago

Benchmark under 3.x (without obsolete patch):

fork + execv + waitpid: 4794.4 ms
os.popen: 19792.9 ms
subprocess.popen: 10152.1 ms

Benchmark under 2.x (without patch:)

fork + execv + waitpid: 4292.7 ms
os.popen: 12697.6 ms
subprocess.popen: 5496.3 ms

Looks like there's a regression on both os.popen and subprocess.popen.

pitrou commented 13 years ago

Here is a quick profile under 3.x:

Time: 10178.0 ms 320812 function calls (320809 primitive calls) in 10.182 seconds

Ordered by: internal time

ncalls tottime percall cumtime percall filename:lineno(function) 10000 5.937 0.001 5.937 0.001 {built-in method read} 10000 1.462 0.000 1.462 0.000 {built-in method waitpid} 10000 1.042 0.000 1.042 0.000 {built-in method fork_exec} 10000 0.752 0.000 8.131 0.001 subprocess.py:1119(_execute_child) 1 0.228 0.228 10.182 10.182 bench_subprocess.py:1(\<module>) 10000 0.177 0.000 8.345 0.001 subprocess.py:631(init) 20000 0.079 0.000 0.079 0.000 {built-in method close} 10000 0.061 0.000 1.584 0.000 subprocess.py:1368(wait)

So it looks like our parent Python process spends its time (read(), waitpid()) waiting for its child...

pitrou commented 13 years ago

Interestingly, reducing from the max open file descriptors from 8192 (my default) to 512 halves the runtime and solves the regression:

$ ulimit -n
512
$ ./python bench_subprocess.py 
pid: 31631
Time: 4903.8 ms

So, even though implemented in C, the file descriptor closing logic is still quite costly!

vstinner commented 13 years ago

pitrou> Victor, your patch doesn't even apply on 3.x. pitrou> That code doesn't exist anymore...

subprocess.Popen() does still read errpipe_read, but using a buffer of 50,000 bytes instead of 1 MB (the traceback is not more send to the parent process).

Benchmark on Python 3.2 (debug build, same computer than msg129880):

fork + execv + waitpid: 20052.0 ms
os.popen: 40241.7 ms
subprocess.Popen (C): 28467.2 ms
subprocess.Popen (C, close_fds=False): 22145.4 ms
subprocess.Popen (Python): 40351.5 ms

Bad:

os.popen is 41% is slower than subprocess: I suppose that it is the usage of stdout=PIPE (creation of the pipe) which make it slower. But 41% is huge just to create a pipe (without writing into it)!
subprocess(close_fds=True) (default) is 22% slower than subprocess(close_fds=False)
os.popen of Python 3 is 56% slower than os.popen of Python 2

Good:

subprocess of Python 3 is 29% faster than subprocess of Python 2.

Other results:

subprocess of Python 3 is 9% slower than patched subprocess of Python 2.
subprocess (default options) is 42% slower than fork+execv+waitpid (this is close to the Python 2 overhead).
subprocess implemented in Python is 42% slower than the C implementation of subprocess.

pitrou> Looks like there's a regression on both os.popen and subprocess.popen.

os.popen() uses subprocess in Python 3. The worst regression is "os.popen of Python 3 is 56% slower than os.popen of Python 2". I don't think that it is related to Unicode because my benchmark doesn't write or read any data.

vstinner commented 13 years ago

os.popen is 41% is slower than subprocess: I suppose that it is the usage of stdout=PIPE (creation of the pipe) which make it slower

Oh no, it's because os.popen() calls subprocess.Popen() with shell=True: the overhead is the shell. Nothing wrong here.

79528080-9d85-4d18-8a2a-8b1f07640dd7 commented 13 years ago

pitrou> I think your analysis is wrong. These mmap() calls are for pitrou> anonymous memory, most likely they are emitted by the libc's pitrou> malloc() to get some memory from the kernel. In other words pitrou> they will be blazingly fast.

Are you sure? :-)

Well, it is fast. It's true that mmap is slower than brk, since the kernel zero-fills the pages, but the overhead probably doesn't come from this, but more likely from pymalloc or malloc, and also from the call to _PyBytes_resize in posix_read when the number of bytes read is smaller than what has been requested.

79528080-9d85-4d18-8a2a-8b1f07640dd7 commented 13 years ago

So, even though implemented in C, the file descriptor closing logic is still quite costly!

Yes, see this recent issue: http://bugs.python.org/issue11284

In the reporter's case, it's much worse, because FreeBSD (at least the version he's using) has a SC_OPEN_MAX of 655000, so that passing close_fds=True bumps the Popen runtime to 3 seconds! Some Unix offer a closefrom() or similar function to close all open file descriptors (and only open ones), which should be much faster. I'm not aware of anything equivalent under Linux, but it might be worth looking into.

vstinner commented 13 years ago

Benchmark on subprocess with a less trivial example. Run 100x python -c pass: 8.63 sec without my patch, 8.53 sec with my patch => only 1% faster, so the patch is just useless on a real world example.

Finally, I think that there is just nothing to do on Python 2: the overhead between fork(), os.popen and subprocess doesn't impact real world programs. On Python 3: the most critical issue is that close_fds=True is much slower than close_fds=False, but there is already an issue for that (bpo-11284).

So let's close this issue. You can still comment it if you disagree. But if you would like to reopen it: please come with numbers of a benchmark on real programs (not on "exit 0" with a shell or /bin/false).

2dbb74b2-df2b-4dba-a2ed-a03b8f933fd0 commented 13 years ago

I think it's still safe to say that high performance applications which need to create many hundreds or thousands of children (e.g. large monitoring systems) will still need another solution that isn't subprocess. That being said, you're right that no one is going to care about the extra overhead of subprocess in a vacuum, and most applications fork one or a very small number of processes at a time and typically end up waiting for orders of magnitude more time for their output than they spend creating the process in the first place.

As I said when I opened this issue, I wasn't terribly concerned with most applications.

That being said, I can't really submit a full-blown monitoring system against this bug, so the best I could do would something that did lots of os.popens or subprocess.Popens in parallel in a contrived way and see how the performance scales as I tune "lots" to different values. Sadly, the time I have for doing that needs to be spent writing other code, so I'll leave this closed and let someone else raise the issue in the future if they run into it.

I can always build a dispatcher in C and communicate with it via IPC to get around the immediate concern of scalability.

pitrou commented 13 years ago

That being said, I can't really submit a full-blown monitoring system against this bug,

Perhaps you can take a look at Shinken, it is a Nagios-lookalike written in Python.

python / cpython

Subprocess suffers 40% process creation overhead penalty #55523