Speed up parsing time - Githubissues

Dob4115 commented 5 years ago

Version 2.4.1.4627 OS: Microsoft Windows NT 10.0.17134.0, x64 Host Product: Microsoft Office 2016 x86 Host Version: 16.0.4822.1000 Host Executable: EXCEL.EXE

I have several large Excel addins containing library functions (one is 800KB+). A reparse on my machine takes ~45 seconds to a minute, complete with Excel and VBE freezes/blank windows, which renders Excel unusable during any reparse.

Is there any way to speed up parsing or, at least make it less invasive? I have turned off rubberduck for the moment so I can work. A couple ideas I had:

A setting that makes all reparses manual (part of the problem is that rubberduck reparses whenever I open a new file or add/delete a sheet. This means that my Excel instance is constantly freezing on me when I have the IDE open but am just working in a workbook.). Understood that Rubberduck functionality may be greyed out until I click reparse and wait.
Related to the line above, only automatically reparse if the VBE window has focus.
Somehow only reparse changed modules (most of my project has not been touched between reparsings). (I dont understand what would be involved in marking modules "changed"! Just would be kind of cool to only sit through a full reparse upon loading, bu thave incremental reparses after that.)
Show all code modules "as-soon-as-possible" in the navigation pane and put parsing into the background so I can move around the projects and work on code while parsing is completing. Yes, this means I need to reparse again before using rubberduck refactoring (for example), but at least I dont have to sit around and wait just to see my code.

bclothier commented 5 years ago

Thank you for reporting. This is always a thing we want to improve.

However, you might want to read about what we already do currently to minimize the parsing time, particularly the section titled "Doing Only What Is Necessary".

The request to load the code pane immediately is kind of already an issue: #3697.

Note that inspections, which isn't a part of the parsing process may be automatically run after a reparse. You may want to try and turn off that via the settings dialog.

We also suggest that you run the log with Trace log and review if there are any cases where the parser is falling back to slower LL mode. There are certain expressions that we cannot resolve using the faster SLL mode, in which case, we must fallback to the slower LL mode and the log will indicate that. If there are several entries about fallbacking to LL, that would be a contributing factor and requires some code changes.

Consider furthermore that because we are limited to a granularity of a module level, a module with 10000 lines is much more expensive to parse, even you only changed a single character compared to a module with only 100 lines. More particularly so if the module itself has several references. Again, this is a case where code needs to be modified.

If you have a specific module that you think it should run fast but is very slow, please share the log with us so we can analyze it.

Of course, we always need more help -- building a fast and reliable parser is a very hard problem and we welcome anyone who want to contribute toward that goal.

retailcoder commented 5 years ago

Thanks for the feedback! I'll add that Rubberduck, like many open-source software tools, is maintained by a small team of volunteers, and as stipulated by its license, is provided as-is, in the hope that it will be useful. We love that you care enough for what we do to file a bug report, that's awesome - feel free to show that care and appreciation by starring our project.

As for how you can speed up the parser/resolver:

Write smaller, more specialized modules. You want your modules to have low coupling (dependencies on other modules) and high cohesion (members of a module are related in terms of functionality).
Avoid late-bound code. Declare your variables, and if you're not getting IntelliSense on a member call, extract a local variable and give it an explicit type. For example Worksheets("Sheet1") returns a Variant: this makes the resolver work harder when trying to resolve the rest of the expression, whereas if the sheet is pulled into a Worksheet variable, the resolver will perform much better.
As stated above, avoid bang!notation and review logs to see if parsing is being slowed down by SLL failures and retries. Simplify complex expressions by extracting local variables.
Remove unused library references, if applicable.

Note that these are good coding practices in general, not just ways to make Rubberduck work better.

Cheers!

Dob4115 commented 5 years ago

Thanks for the additional information. I looked through the options several times but had not seen the ability to not perform inspections... I turned that off and it is a little faster.

You said:

Parsing is performed in the background, and
You already only reparse modified modules

I logged opening the VBE (initial parse took 1:12 according to the log) and then a couple minutes later, after not changing anything, I reparsed (this took 1:02). I was expecting more than 10 seconds difference, does this make sense? I also am interested, if parsing is in the background, why Excel (not just the IDE) freezes for most of the parsing. The freezing is new in the latest stable version (previously I used 2.4.0 stable).

I looked through the log and did see some instances of SLL failing, but not many (to my untutored eye, however!). I also saw a bunch of

MainParseSyntaxErrorException: extraneous input ',' expecting {<EOF>, ':', REM, NEWLINE, ''', WS, LINE_CONTINUATION`

(e.g., at 2019-04-08 15:01:47.4469). Would these be slowing things significantly?

I looked further at the instance I provide and the line of code at the address specified (L197C29) is:

        .Sorted = False

This does not have 29 characters... I'm a little confused about what I can change to eliminate the error. RubberduckLog.0.txt

Vogel612 commented 5 years ago

The log you provided has two parsing runs. The first parsing run takes ~62 seconds according to the log, the second takes 40.

The first run also provokes multiple (14) SLL failures and falls back to LL mode. None of these appear in the second run.

These times break down into the following specific timings (according to the log)

Run 1:

Loading and unloading referenced libraries ~1600ms
Parsing user modules ~13000ms
Resolving user declarations ~1300ms
Executing a compilation pass ~900ms
Resolving references ~13000ms

Run 2:

Loading and unloading referenced libraries ~300ms
Executing a compilation pass ~300ms
Resolving references ~2000ms

I've skipped some items with times below 10ms. Note that the second run does not parse user modules. Also all operations that it does perform are completed faster than in the preceding run.

I also want to add a word on the syntax error you found in the log:
Unless Rubberduck explicitly tells you there is a syntax error while the VBE compiles the code just fine, we won't be able to do very much about it. Rubberduck uses the "SLL prediction mode" of the ANTLR-generated parser to speed up the parsing process. That prediction mode does not guarantee that the grammar is applied correctly, though.

This can result in "faux syntax errors". When reparsing using the slower, but 100% correct prediction mode ("LL"), most of these errors vanish. Only if the LL parsing fails, does Rubberduck tell you there is a syntax error.

retailcoder commented 5 years ago

SLL failures cost quite a lot; they involve catching and logging an exception, so .NET needs to unwind the call stack, and then the parse run gets retried with the slower prediction mode.

As for L197C29 not pointing to anything, see if the same position in the exported code file makes more sense - Rubberduck exports each module to a temp file that is then processed with a simplified grammar, to pick up module & member attributes (unless that was changed): it's possible that line 197 of the exported code file yields more information about what the problem might be.

If you can find & share the code SLL is choking on, we can probably help tweak it to help the parser process it orders of magnitude faster - I completely agree that this shouldn't be needed (tweaking your code, that is), but VBA being a painfully ambiguous language to parse, it's entirely possible some phrasings perform better than others.

I see you're running 32-bit Office on a 64-bit Windows... 32-bit limitations might be interfering too, memory-wise, causing more pressure on the hard drive - and waiting on the hard drive for memory swaps can't be helped by throwing more threads at the problem. This could be why the process is "freezing" (no repro here, 64-bit Excel 2016, Win7 Pro x64).

Dob4115 commented 5 years ago

I'm doing some investigating... do underscores in class module methods trigger an LL parse? That's the only thing I can figure. Here is one piece of code that triggers an LL reparse (from the exported codemodule... it has one tab indent):

`2019-04-08 15:01:37.7998;WARN-2.4.1.4627;Rubberduck.Parsing.VBA.Parsing.TokenStreamParserBase;SLL mode failed while parsing the CodePaneCode version of module uf_TreeviewControls at symbol , at L50C26. Retrying using LL.;

[Line 50 follows]

TVHandler.MoveNode tr_Demote

`

(Yes, I know undescores are a bad thing and I understand why... just haven't yet cleaned up my legacy sloppy coding-style!)

32-bit office on 64-bit windows is because I still use the TreeView and ListView controls.

bclothier commented 5 years ago

The underscore isn't likely - I setup a repro and verified it does not cause a fallback.

Public Enum foo
    f_bar
End Enum

Public Sub Test()
    Cells(1, 1).Value = foo.f_bar
End Sub

It might be something to do with the type of TVHandler or the method MoveNode, or the fact that tr_Demote isn't fully qualified and is ambiguous.

retailcoder commented 5 years ago

from the exported codemodule

Sorry for the confusion - in "the CodePaneCode version of module uf_TreeViewControls", CodePaneCode is coming from an enum:

https://github.com/rubberduck-vba/Rubberduck/blob/1cf6e0f9be5d16111323731804bd6feb678887ac/Rubberduck.Parsing/VBA/Parsing/CodeKind.cs#L3-L8

(unrelated: why do we have two spaces between enum and CodeKind?)

AttributesCode would be off the exported module; CodePaneCode is straight off the code panes (SnippetCode is used for ad-hoc parsing in other areas, not during normal parse tasks). In other words, L50C26 would be around line 50 in the actual editor - the SLL failure is likely unrelated to tr_Demote, its underscore, or anything in line 50 of the exported module.

Dob4115 commented 5 years ago

Aha... in that case, here is line 50 from the editor... one indent:

TVHandler.AddNewNode , tr_ChildLast

TVHandler is a custom class that I use to interact with a TreeView control. AddNewNode looks like this:

Public Function AddNewNode(Optional Relative As MSComctlLib.Node, Optional Relationship As TreeRelationship = tr_ChildLast)

retailcoder commented 5 years ago

Must be the skipped argument throwing it off, see if using named arguments helps it:

TVHandler.AddNewNode Relationship:=tr_ChildLast

Or, since the specified optional argument has the same value as the default parameter value, consider not supplying it:

TVHandler.AddNewNode

That said, AddNewNode being a Function, I'd expect it to return the created node object, to be captured at the call site:

Set node = TVHandler.AddNewNode

Hope it helps!

retailcoder commented 5 years ago

Am I the only one this is giving "Rubberduck Opportunities" inspection ideas? We could warn about skipped optional arguments thwarting SLL prediction mode (perhaps not in these words) and make a quickfix that introduces named arguments and removes the skipped arguments, making Rubberduck work better.

Dob4115 commented 5 years ago

Is there any way to tell from the logfile how much extra time the SLL failures cost? There are 14 such failures (logfile line 873 is the one I've mentioned).

Would these errors account for Excel freezing in the latest version of RubberDuck, in contrast to the last version, which did not freeze the main Excel window while parsing?

Vogel612 commented 5 years ago

@Dob4115 That's nearly impossible to tell. We'd need to know how much time a completion in SLL mode would've taken to subtract that from the LL parsing time. But we can't know that, nor do we time individual module parses.

It's just not a metric that gives us as development team a lot of information. Parsing basically takes as long as it takes, aside from a few exceptions where ANTLR is bugged or the grammar could be optimized for a specific case (which comes with costs for other cases ...)

We're pretty happy with parsing times as they are, especially considering how many quirks of VBA and the VBE we're handling correctly...

retailcoder commented 5 years ago

Is there any way to tell from the logfile how much extra time the SLL failures cost?

Takes a bit (lots!) of mucking around with Excel, but yeah... kind of.

I managed to parse enough of the logs to make a data set that can be pivoted, then calculated the difference between [Started] and [Ready] states for each module.

So with a population of 210 modules (!) and pivoting on the MIN timestamp per state/module, time to "Ready" state is between 0:47 and 1:00, with an average of 0:54 and a standard deviation of 0:04.

But "Ready" state is the whole entire process, including parse tree traversals and identifier resolution - which SLL failures don't affect at all.

Now, there's more than a single parse run going on here; the 14 failures are really off 7 modules - and now we're starting to get a clearer picture of how SLL failures are affecting the run:

pivoted timestamps for SLL-failed modules

As you can see, the effect of SLL failures and LL retries is clearly not where the majority of the processing is going. A little bit of simple diffing math, and we can break down the duration of each subsequent parser state:

parsing stages

Note that resolving declarations only begins after all modules reach "Parsed" state, and the SLL-Fail timestamp includes unwinding the call stack to log the exception's stack trace, which doesn't appear to be insignificant.

Also note... the bulk of the "parsing" time is spent resolving the declarations, i.e. traversing the parse trees and building a usable representation that our internal API can consume. That step comletely runs on background threads (that's part of why many small modules is better than few massive ones), and shouldn't be locking up the host for any reason - unless you're running with limited resources.

Attaching the workbook I used to do these pivots, if anyone wants to dig further into this:

ParsedLogs-Issue4905.xlsx

Dob4115 commented 5 years ago

Nice analysis! So let me clarify my original "problem" a bit. There are two things that, in combination, give me grief:

Codepane does not allow me to navigate modules while RD is parsing (I completely understand that anything below modules is unavailable while parsing!). This, as alluded to before, is already a ticket.
Excel freezing during RD parsing.

If these two issues were resolved, I could at least move around and start reviewing code while waiting for the parser to finish. (I'd be ok with editing while parsing, though it's obvious that the current parse would be of limited value for the newly edited code!)

As I said before, the freezing is new with this build of RD. I didn't think specs were a problem:

Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz

Base speed: 3.70 GHz
Sockets: 1
Cores: 6
Logical processors: 12
Speed 3.68 GHz

16GB of 2666MHz memory.

Also, I'm not trying to point fingers or downplay what the RD team has already achieved, particularly regarding parsing. Between the addin itself and the blog posts, my coding is getting much more maintainable! (The posts on clean code alone were revolutionary, thank you.) I understand that parsing has gotten much scrutiny and am not trying to suggest that "there has to be a faster way to do it." I'm just trying to better understand what's going on.

Thanks also for spending the time to answer my questions and perform research!

retailcoder commented 5 years ago

Specs look very much right - note that memory is being artificially constrained by running a 32-bit host: all that RAM is essentially useless, and I suspect a lot of disk activity is going on to compensate: that is what's slowing it down. As already mentioned and proven by the logs, pretty much everything that happens during a parse, happens on a background thread - there's nothing more we can do to avoid "freezing"... unless we move the parsing out-of-process.... which is wicked complex at this point.

MDoerner commented 5 years ago

I have done some research under a profiler. I see two leads we should really follow:

When actually parsing the modules, we spend a lot of time waiting for locks to save the results.
While reloading the code explorer and the code metrics window, we completely block the UI thread spamming it with file reads for the settings.

The second point is the freeze you observe. To fix that we probably just need to read the settings only once and not repeatedly. Moreover, we might want to look for more heavy operations and push them to background tasks.

Regarding the first point, we should evaluate whether there are better methods to collect the data and then write it to the cache in one swoop.

Dob4115 commented 5 years ago

@MDoerner , your second point makes sense with what I am seeing. Everything runs ok for, say, 2/3's of the parse, then the GUI locks up until the code explorer "suddenly" populates.

rubberduck-vba / Rubberduck

Speed up parsing time #4905