Performance issues in `google` and `numpy` style parsers

webknjaz commented 2 years ago

I was integrating the WPS linter bundle of curated flake8 linters that happens to depend on darglint. With the initial integration attempt, I've noticed something unexpected — flake8 took about 11 minutes on my laptop to complete (longer in slower envs). This isn't exactly responsive, is it? But I've also not seen such slow invocations with similar setups in the past. So what is going on, I wondered? I've first realized narrowed it down to a "slowly linted" folder and then, to specific files. The slowest one took almost 10 minutes to lint. Yes, one file! It was https://github.com/ansible/ansible-navigator/blob/295b152/src/ansible_navigator/runner/base.py. It's not particularly big (just 215 LOC). Since WPS brings a lot of linters, I've decided not to check each one but instead, I used py-spy to profile the flake8 invocation on this single file. This tool produces SVG flame charts. After looking at the produced visualization, I noticed that darglint was more prominent than anything else. I quickly confirmed my suspicion by doing pip uninstall darglint and seeing that after that flake8 completely the linting process almost instantly. Later, I've also confirmed that invoking darglint directly had the same performance penalty so the problem is not in the integration with flake8 but within darglint itself.

With that in mind, I kept digging in hopes of seeing the solution to my problems. I used py-spy on a raw darglint invocation, checked the chart again, and to my surprise, the most CPU time was consumed by google.py. I didn't expect this because I usually use the Sphinx style for structured docstrings — apparently, I forgot to add a config for darglint and it uses the google style by default. It didn't take long for me to check that switching to sphinx fixed the problem. This is the solution for me.

To file this bug, I've also checked if numpy is affected too. And it is, but it's slightly less slow. There are the timings of linting that file I linked above:

$ for style in sphinx numpy google; do echo Checking $style docstrings...; time darglint -s $style src/ansible_navigator/runner/base.py &>/dev/null; done
Checking sphinx docstrings...
darglint -s $style  &>   0.13s user 0.01s system 99% cpu 0.136 total
Checking numpy docstrings...
darglint -s $style  &>   444.56s user 0.77s system 99% cpu 7:29.18 total
Checking google docstrings...
darglint -s $style  &>   566.80s user 1.03s system 98% cpu 9:34.21 total

As you can see, google is 4360x slower than sphinx. And numpy is 3420x slower than sphinx. I haven't looked too deep but it sounds like the bottleneck is in darglint/parse/cyk.py. I've also generated the flame charts for each of the styles, using the same linted file with these commands:

$ for style in sphinx numpy google; do echo Checking $style docstrings...; py-spy record -o py-spy-darglint-$style-style-flame.svg -t -g -i -n -s -- darglint -s $style src/ansible_navigator/runner/base.py &>/dev/null; done
Checking sphinx docstrings...
Checking numpy docstrings...
Checking google docstrings...

Download py-spy-darglint-flames.tar.gz — SVG flame charts

twoertwein commented 2 years ago

I haven't looked too deep but it sounds like the bottleneck is in darglint/parse/cyk.py.

If cyk.py (and ideally all its imports) are sufficiently type annotated, it might be worth to use mypyc to compile this file to a python c-extension.

paw-lu commented 2 years ago

I'm having drastic performance penalties if colons (:) are used along with Google style docstrings. https://github.com/terrencepreilly/darglint/issues/198

@webknjaz, I see that in your example you also have extra colons in your docstring:

L51
L65

webknjaz commented 2 years ago

Wow, good catch! I would've never guessed... Good thing that I didn't intend to use non-native sphinx syntax — it was an unintended configuration :)

cjolowicz commented 2 years ago

I could confirm cyk.py as the bottleneck since at least v1.2.3 (2020-04-18), using py-spy. But the performance issue has probably existed ever since darglint switched to CYK in 1.0.0-alpha (2019-09-28):

Changed the recursive descent parser to a CYK parser. This was a significant change. It makes darglint much more flexible and extensible. However, it also introduces a significant performance regression. For that reason, this will be released first as an alpha, while I figure out how to handle the performance regression (or I determine whether or not it's even all that important of a problem.)

darglint implements the CYK algorithm to parse Google and numpy docstrings (edit: and also Sphinx docstrings) from a CNF grammar (generated from a BNF source). numpy docstring support was only implemented after the switch to CYK. So this cannot be solved by simply reverting to recursive descent.

AFAIU there are other parsing algorithms for CNF grammars with better average time complexity. Another approach would be to try pinpoint and solve this in the grammar(s). The grammar is quite large, though.

cjolowicz commented 2 years ago

It seems to me that colons slow down parsing not because of some problematic rule in the grammar. Rather, they result in a larger number of tokens being fed to the parser (instead of just a few tokens with a lot of words inside). CYK runtime is cubic in the number of input tokens, so this has a pronounced effect on performance.

Sphinx docstrings are parsed quite differently from Google docstrings. When a Google docstring is parsed, the entire docstring body is fed to CYK. In a Sphinx docstring, each :<keyword> <name>: <text> entry is fed to CYK separately, resulting in much smaller input per parse. Given the cubic complexity of the algorithm, it's clear why Sphinx docstrings parse so much faster.

Given these findings, I believe the key to fixing the performance issue for Google docstrings is breaking them apart into sections and section entries in a first pass, and using CYK only on the individual section entries. I haven't looked at numpy docstrings, but their case should be similar.

@terrencepreilly Would you consider this a viable approach?

terrencepreilly commented 2 years ago

breaking them apart into sections and section entries in a first pass, and using CYK only on the individual section entries

In fact, this is already what it does. See, e.g. https://github.com/terrencepreilly/darglint/blob/master/darglint/parse/combinator.py.

In fact, the speed of darglint isn't typically much of an issue for most projects. If a project has a lot of functions with >10 parameters, then there are probably more fruitful areas where the maintainers could put effort than maintaining docstrings. I can see it being frustrating for users who use rst in docstrings. But then, there are alternatives to darglint which should be fine for most use cases. (Someone recently mentioned one from flake8 which does a great job.)

I started on a solution which would use an LR parser. Any docstrings which fail to parse would fall back to CYK, so only invalid docstrings would take a significant amount of time. Unfortunately, there aren't great libraries for parser generators which could be easily integrated and which match darglint's requirements. So I started writing a parser generator: https://github.com/terrencepreilly/darglint/tree/parser-generator. I haven't had time to finish it, though (there are some subtle bugs that need to be worked out before generating the grammars.) I likely won't finish it, though, as I've moved on to other projects push-down LR parser.

If speed is a major impediment, check out https://github.com/terrencepreilly/darglint/issues/177, where someone brought up a flake8 plugin which does similar things to darglint. It won't make style suggestions, and can't handle as many ambiguous situations, but I imagine that's probably fine for most folks.

webknjaz commented 2 years ago

I think you mean pylint, not flake8.

cjolowicz commented 2 years ago

breaking them apart into sections and section entries in a first pass, and using CYK only on the individual section entries

In fact, this is already what it does. See, e.g. https://github.com/terrencepreilly/darglint/blob/master/darglint/parse/combinator.py.

It does break up the input into sections (meaning the "Args:", "Raises:" sections, etc), and I missed that in my earlier post. But darglint does not break up the input into section entries (meaning the individual arguments, or exceptions, and so on). This is why Sphinx docstrings are so much faster, they don't have the second level of nesting.

FWIW I experimented with this approach, and I could cut down the runtime of the example from #198 from over two minutes to under a second. Unfortunately, it comes at the price of losing detection of some style errors, such as wrong indentation. Specifically, my approach was this:

Break up each section into section entries, and inject the section header before each entry.
Join continuation lines in individual entries, essentially removing the sequence NEWLINE INDENT INDENT.
Merge colon tokens into word tokens, unless it's the first colon on a line.

In the end, I came to the same conclusion. Using another algorithm than CYK where possible would probably give us the same, or greater, performance improvements, without introducing complexity to the token preprocessing stage, and without the loss in accuracy. But I ran out of time 😉

Did you have a look at lark? It uses the Earley algorithm by default, but you can switch the algorithm to LALR(1) if the grammar supports it. There's also a CYK implementation.

terrencepreilly / darglint

Performance issues in `google` and `numpy` style parsers #186