python / cpython

The Python programming language
https://www.python.org
Other
63.2k stars 30.26k forks source link

argparse.ArgumentParser is slow when parsing a very long option list #96859

Open alexeypa opened 2 years ago

alexeypa commented 2 years ago

Bug report

The following parser exhibits O(N^2) behavior when parsing a very long option list (e.g. --item=a --item=b ...):

parser = argparse.ArgumentParser(description="test")
parser.add_argument('--item', dest='accumulate', action='append')

Here is a simple repro that measures the time parse_args() take depending on the number of options:

#!/usr/bin/env python3
import argparse
from datetime import datetime

def run_test(num_iterations):
    """Parse a list with 'append' action and measure the time it takes."""
    parser = argparse.ArgumentParser(description="test")
    parser.add_argument('--item', dest='accumulate', action='append')

    start = datetime.now()

    args = parser.parse_args(['--item=x' for _ in range(num_iterations)])

    end = datetime.now()
    diff = end - start

    print(f"{num_iterations:8} iterations: {diff} s")

def main():
    for i in range(8, 18):
        run_test(pow(2, i))

if __name__ == "__main__":
    main()

The output on my machine:

> ./repro.py
     256 iterations: 0:00:00.002699 s
     512 iterations: 0:00:00.008444 s
    1024 iterations: 0:00:00.028508 s
    2048 iterations: 0:00:00.104813 s
    4096 iterations: 0:00:00.389232 s
    8192 iterations: 0:00:01.552962 s
   16384 iterations: 0:00:06.132139 s
   32768 iterations: 0:00:25.233547 s
   65536 iterations: 0:01:51.921170 s
  131072 iterations: 0:07:42.047914 s

Your environment

Linked PRs

alexeypa commented 2 years ago

Here is a simple repro that measures the time parse_args() take depending on the number of options

For comparison, here is how https://github.com/alexeypa/cpython/commit/548376ef8f8e0073a847c871aee040cbb097cc52 behaves with the same repro:

> ./repro.py
     256 iterations: 0:00:00.001061 s
     512 iterations: 0:00:00.002236 s
    1024 iterations: 0:00:00.005026 s
    2048 iterations: 0:00:00.013403 s
    4096 iterations: 0:00:00.040210 s
    8192 iterations: 0:00:00.133397 s
   16384 iterations: 0:00:00.479814 s
   32768 iterations: 0:00:01.837926 s
   65536 iterations: 0:00:07.891365 s
  131072 iterations: 0:00:34.732420 s

There is still non-linear growth caused by _AppendAction.__call__() cloning the whole list of parsed values (see _copy_items()) on every append but it is much faster already. The _copy_items() behavior can be adjusted on a separate commit.

Let me go through a submission checklist before creating a PR for this fix.

alexeypa commented 2 years ago

There is still non-linear growth caused by _AppendAction.call() cloning the whole list of parsed values (see _copy_items()) on every append but it is much faster already. The _copy_items() behavior can be adjusted on a separate commit.

I updated the actions code to copy lists only once on https://github.com/alexeypa/cpython/commit/856ca357c2523ce7486b4df002438d546aaf3f7c. This improved the situation significantly:

> ./repro.py                                 
     256 iterations: 0:00:00.001032 s
     512 iterations: 0:00:00.001732 s
    1024 iterations: 0:00:00.003265 s
    2048 iterations: 0:00:00.006883 s
    4096 iterations: 0:00:00.012096 s
    8192 iterations: 0:00:00.024929 s
   16384 iterations: 0:00:00.049330 s
   32768 iterations: 0:00:00.101852 s
   65536 iterations: 0:00:00.199928 s
  131072 iterations: 0:00:00.428487 s

I think it makes sense to lend this fix on a separate PR, since it addresses a different problem then https://github.com/python/cpython/pull/96904.

alexeypa commented 1 year ago

Hey, trying to bring some attention to https://github.com/python/cpython/pull/96904. The PR is still waiting for a core review. Is anyone available to take a look?

serhiy-storchaka commented 4 weeks ago

Sorry for the lack of response, @alexeypa. The argparse module does not have active maintainer and have a long list of old bugs, so it is unsurprisingly that a minor performance issue slipped past attention of core developers.

Current results of your script on my computer are close to your results in https://github.com/python/cpython/issues/96859#issuecomment-1248980992:

     256 iterations: 0:00:00.004519 s
     512 iterations: 0:00:00.007478 s
    1024 iterations: 0:00:00.017219 s
    2048 iterations: 0:00:00.035217 s
    4096 iterations: 0:00:00.076842 s
    8192 iterations: 0:00:00.194647 s
   16384 iterations: 0:00:00.566328 s
   32768 iterations: 0:00:01.830162 s
   65536 iterations: 0:00:06.369726 s
  131072 iterations: 0:00:24.533077 s

The performance was improved in 4a630980328b67f0aba6a04c3950aa9dbd532895 (gh-116159).

But your latter results (https://github.com/python/cpython/issues/96859#issuecomment-1250538230) are amazing. If you still interesting in this, could you please create a new PR? I can't promise that it will be accepted (it may either be incompatible with the overall design of argparse or have other fatal flaws), but it will be considered.

If you have no interest, I'll close this issue.

alexeypa commented 3 weeks ago

If you still interesting in this, could you please create a new PR?

Well, it is easy enough to try: https://github.com/python/cpython/pull/124740. This PR covers only the cloning behavior described on https://github.com/python/cpython/issues/96859#issuecomment-1250538230.

serhiy-storchaka commented 3 weeks ago

Oh, so the only issue that left is copying the target list in append and extend actions? Then I think simpler changes will help. See #124909.

alexeypa commented 3 weeks ago

Oh, so the only issue that left is copying the target list in append and extend actions? Then I think simpler changes will help. See #124909.

Well, it was the bulk of it. 548376ef was also removing max(option_string_indices) and making the core loop a bit more straightforward for loop over a list instead of a while loop with multiple branches updating start_index. It looks like argparse is hyper-sensitive to potential regressions so this change was ultimately rejected.

I posted a comment on your PR.