Open webknjaz opened 2 years ago
I haven't looked too deep but it sounds like the bottleneck is in
darglint/parse/cyk.py
.
If cyk.py (and ideally all its imports) are sufficiently type annotated, it might be worth to use mypyc to compile this file to a python c-extension.
I'm having drastic performance penalties if colons (:
) are used along with Google style docstrings. https://github.com/terrencepreilly/darglint/issues/198
@webknjaz, I see that in your example you also have extra colons in your docstring:
Wow, good catch! I would've never guessed... Good thing that I didn't intend to use non-native sphinx syntax — it was an unintended configuration :)
I could confirm cyk.py as the bottleneck since at least v1.2.3 (2020-04-18), using py-spy. But the performance issue has probably existed ever since darglint switched to CYK in 1.0.0-alpha (2019-09-28):
- Changed the recursive descent parser to a CYK parser. This was a significant change. It makes darglint much more flexible and extensible. However, it also introduces a significant performance regression. For that reason, this will be released first as an alpha, while I figure out how to handle the performance regression (or I determine whether or not it's even all that important of a problem.)
darglint implements the CYK algorithm to parse Google and numpy docstrings (edit: and also Sphinx docstrings) from a CNF grammar (generated from a BNF source). numpy docstring support was only implemented after the switch to CYK. So this cannot be solved by simply reverting to recursive descent.
AFAIU there are other parsing algorithms for CNF grammars with better average time complexity. Another approach would be to try pinpoint and solve this in the grammar(s). The grammar is quite large, though.
It seems to me that colons slow down parsing not because of some problematic rule in the grammar. Rather, they result in a larger number of tokens being fed to the parser (instead of just a few tokens with a lot of words inside). CYK runtime is cubic in the number of input tokens, so this has a pronounced effect on performance.
Sphinx docstrings are parsed quite differently from Google docstrings. When a Google docstring is parsed, the entire docstring body is fed to CYK. In a Sphinx docstring, each :<keyword> <name>: <text>
entry is fed to CYK separately, resulting in much smaller input per parse. Given the cubic complexity of the algorithm, it's clear why Sphinx docstrings parse so much faster.
Given these findings, I believe the key to fixing the performance issue for Google docstrings is breaking them apart into sections and section entries in a first pass, and using CYK only on the individual section entries. I haven't looked at numpy docstrings, but their case should be similar.
@terrencepreilly Would you consider this a viable approach?
breaking them apart into sections and section entries in a first pass, and using CYK only on the individual section entries
In fact, this is already what it does. See, e.g. https://github.com/terrencepreilly/darglint/blob/master/darglint/parse/combinator.py.
In fact, the speed of darglint isn't typically much of an issue for most projects. If a project has a lot of functions with >10 parameters, then there are probably more fruitful areas where the maintainers could put effort than maintaining docstrings. I can see it being frustrating for users who use rst in docstrings. But then, there are alternatives to darglint which should be fine for most use cases. (Someone recently mentioned one from flake8 which does a great job.)
I started on a solution which would use an LR parser. Any docstrings which fail to parse would fall back to CYK, so only invalid docstrings would take a significant amount of time. Unfortunately, there aren't great libraries for parser generators which could be easily integrated and which match darglint's requirements. So I started writing a parser generator: https://github.com/terrencepreilly/darglint/tree/parser-generator. I haven't had time to finish it, though (there are some subtle bugs that need to be worked out before generating the grammars.) I likely won't finish it, though, as I've moved on to other projects push-down LR parser.
If speed is a major impediment, check out https://github.com/terrencepreilly/darglint/issues/177, where someone brought up a flake8 plugin which does similar things to darglint. It won't make style suggestions, and can't handle as many ambiguous situations, but I imagine that's probably fine for most folks.
I think you mean pylint, not flake8.
breaking them apart into sections and section entries in a first pass, and using CYK only on the individual section entries
In fact, this is already what it does. See, e.g. https://github.com/terrencepreilly/darglint/blob/master/darglint/parse/combinator.py.
It does break up the input into sections (meaning the "Args:", "Raises:" sections, etc), and I missed that in my earlier post. But darglint does not break up the input into section entries (meaning the individual arguments, or exceptions, and so on). This is why Sphinx docstrings are so much faster, they don't have the second level of nesting.
FWIW I experimented with this approach, and I could cut down the runtime of the example from #198 from over two minutes to under a second. Unfortunately, it comes at the price of losing detection of some style errors, such as wrong indentation. Specifically, my approach was this:
NEWLINE INDENT INDENT
.In the end, I came to the same conclusion. Using another algorithm than CYK where possible would probably give us the same, or greater, performance improvements, without introducing complexity to the token preprocessing stage, and without the loss in accuracy. But I ran out of time 😉
Did you have a look at lark? It uses the Earley algorithm by default, but you can switch the algorithm to LALR(1) if the grammar supports it. There's also a CYK implementation.
I was integrating the WPS linter bundle of curated flake8 linters that happens to depend on
darglint
. With the initial integration attempt, I've noticed something unexpected — flake8 took about 11 minutes on my laptop to complete (longer in slower envs). This isn't exactly responsive, is it? But I've also not seen such slow invocations with similar setups in the past. So what is going on, I wondered? I've first realized narrowed it down to a "slowly linted" folder and then, to specific files. The slowest one took almost 10 minutes to lint. Yes, one file! It was https://github.com/ansible/ansible-navigator/blob/295b152/src/ansible_navigator/runner/base.py. It's not particularly big (just 215 LOC). Since WPS brings a lot of linters, I've decided not to check each one but instead, I used py-spy to profile the flake8 invocation on this single file. This tool produces SVG flame charts. After looking at the produced visualization, I noticed thatdarglint
was more prominent than anything else. I quickly confirmed my suspicion by doingpip uninstall darglint
and seeing that after thatflake8
completely the linting process almost instantly. Later, I've also confirmed that invokingdarglint
directly had the same performance penalty so the problem is not in the integration withflake8
but withindarglint
itself.With that in mind, I kept digging in hopes of seeing the solution to my problems. I used
py-spy
on a rawdarglint
invocation, checked the chart again, and to my surprise, the most CPU time was consumed bygoogle.py
. I didn't expect this because I usually use the Sphinx style for structured docstrings — apparently, I forgot to add a config fordarglint
and it uses thegoogle
style by default. It didn't take long for me to check that switching tosphinx
fixed the problem. This is the solution for me.To file this bug, I've also checked if
numpy
is affected too. And it is, but it's slightly less slow. There are the timings of linting that file I linked above:As you can see,
google
is 4360x slower thansphinx
. Andnumpy
is 3420x slower thansphinx
. I haven't looked too deep but it sounds like the bottleneck is indarglint/parse/cyk.py
. I've also generated the flame charts for each of the styles, using the same linted file with these commands:Download
py-spy-darglint-flames.tar.gz
— SVG flame charts