terrencepreilly / darglint

A python documentation linter which checks that the docstring description matches the definition.
MIT License
481 stars 41 forks source link

Linting on functions with many arguments is unexpectedly slow #101

Closed kvalv closed 4 years ago

kvalv commented 4 years ago

Hi, I use darglint for pre-commits. I noticed that the pre-commit takes a lot of time when files containing function(s) with many arguments are staged.

Seems to be the case that darglint spends a surprisingly long time when there are many arguments. On my own benchmarks, if the function has 3 arguments it takes 0.1 second to process it, whereas if it has 14 arguments it spends 4 seconds.

Here's a (ugly) code snippet that reproduces the behaviour.

import pathlib
import time
import subprocess

def compile_function(n, filename):
    L = "".join([f"a{i}, " for i in range(n)])
    args = "\n        ".join([f"a{i}: some_type" for i in range(n)])

    s = f"""
def f({L}):
    '''
    Args:
        {args}

    Returns:
        an int
    '''
    return 123"""

    pathlib.Path(filename).write_text(s)

####
filename = "/tmp/def.py"
for i in range(1, 25):
    compile_function(i, filename)
    t0 = time.time()
    subprocess.check_output(f"darglint {filename}", shell=True)
    t1 = time.time()

    print(f"Num arguments: {i:<2}     Time taken: {t1 - t0:.1f}")

And then running python <the-snippet>, I get the following output:

Num arguments: 1      Time taken: 0.1
Num arguments: 2      Time taken: 0.1
Num arguments: 3      Time taken: 0.1
Num arguments: 4      Time taken: 0.2
Num arguments: 5      Time taken: 0.3
Num arguments: 6      Time taken: 0.4
Num arguments: 7      Time taken: 0.7
Num arguments: 8      Time taken: 0.9
Num arguments: 9      Time taken: 1.2
Num arguments: 10     Time taken: 1.6
Num arguments: 11     Time taken: 2.0
Num arguments: 12     Time taken: 2.6
Num arguments: 13     Time taken: 3.2
Num arguments: 14     Time taken: 4.0
Num arguments: 15     Time taken: 4.8
Num arguments: 16     Time taken: 5.8
Num arguments: 17     Time taken: 6.9
Num arguments: 18     Time taken: 8.1
Num arguments: 19     Time taken: 9.5
Num arguments: 20     Time taken: 10.8
Num arguments: 21     Time taken: 12.6
Num arguments: 22     Time taken: 14.3
Num arguments: 23     Time taken: 16.1
Num arguments: 24     Time taken: 18.5

The function itself is super simple; it just adds a docstring where each argument is a some_type, and the function returns the number 123. I would expect this to run very fast (~0.1 seconds), for any number of arguments. However, with 20 arguments the processing time is a whooping 10 seconds.

Is this behaviour expected? For reference, I use version 1.4.1 on darglint.

terrencepreilly commented 4 years ago

Yes, this is expected. In order to capture different formats and correctly output error messages, darglint's grammar is ambiguous. Previously, darglint used an LR2 parser, which was fast, but would simply die when encountering unexpected formatting. It now uses the CYK algorithm, which is O(n^3 * |G|), where |G| is the size of the grammar. I've been able to make most cases run in an acceptable amount of time by reducing the size of the input (see the wiki.) However, arguments are always going to be pretty slow, because they can't be condensed further. Normally, that's not an issue, since having more than 7 arguments is a pretty strong code smell.

The runtime could probably be reduced by using a different algorithm and just accepting the occasional blow-up. But CYK is easy to understand, and makes identifying certain errors simple. Another approach would be to thread more aggressively. But that could lead to increased runtimes in most cases. At some point, I'll probably just rewrite the CYK parsing step in Cython or C or Rust to make it faster. I've just avoided it because it requires compilation on the user side, and nobody has complained of problems with speed.

I'm going to close this, as this speed regression is known and expected, even if it is a problem. Feel free to leave more comments, or suggest alternatives.

kvalv commented 4 years ago

I took a second look at it some days ago, and it seems to increase by the number of tokens (as you say), which would be the n. Each newline generates two tokens ('\n and '<text>' -- the actual text). So if I have something like this:

def f(a, b):
    """blah blah

    Args:
        a: int
            short description
        b: float
            a very
            very
            very
            very
            very loooong
            very looooooong
            description

The function only has 2 arguments, but because the b argument is described alot, the number of tokens is still quite high.

An easy fix for this would be to merge the tokens that are <text> <newline> <text> into a single token, and then to repeatedly apply this merging for those tokens that can be easily merged. In that case, the b argument would be two tokens (I think); a long text and a newline.

I think darglint ran extremely slow in my case because I had high-level functions with many optional arguments (~10), each being described extensively. I think it's a valid use case, and the tensorflow library also have some functions with huge docstrings. darglint won't be able to parse those functions in a reasonable amount of time.

I would be interested in taking a look at that, if you think it's the appropriate solution and if it makes sense. :)

terrencepreilly commented 4 years ago

You could probably do that, though I think it would be difficult to make it work reliably. You'll have to condense these tokens after the sections are split off (somewhere after this line's execution, but probably in the google parser file).

If you're already using darglint, then your docstrings are probably mostly well formatted. You can take advantage of that previous work to speed up execution time. That is, you could write an LR parser for argument sections, and add that to the parser combinator here. (You could probably even go back in darglint's history and pull the LR parser that was used previously.) That way, if the LR parser fails, it will move onto using the CYK parser.

kvalv commented 4 years ago

Interesting.

I could try your second suggestion, to bring back the previous LR parser and see if that solves my issues. I'll try to make a PR if I succeed. :)

s-weigand commented 3 years ago

Maybe a --fast-path flag with which users accept the "occasional blow-up" in favor of a performance increase could be a valid compromise? 🤔 Another benefit of this way would be to detect mixed docstring styles or malformed docstrings.

matteosantama commented 3 years ago

A --fast-path option would be much appreciated from my end!