rubberduck-vba / Rubberduck

Every programmer needs a rubberduck. COM add-in for the VBA & VB6 IDE (VBE).
https://rubberduckvba.com
GNU General Public License v3.0
1.92k stars 301 forks source link

Need for benchmarking tests #4311

Open bclothier opened 6 years ago

bclothier commented 6 years ago

In light of #4307, @Vogel612 made a comment about how Rubberduck needs benchmark tests. Seeing how performance can be impacted in most benign manner possible, I'm inclined to agree. Our challenge, however is in automating the benchmarks in a reliable manner.

The challenges are as following:

1) We would need to be able to host the VBA projects or install Visual Studio 8 on AppVeyor, which has licensing issues

2) Even if licensing wasn't a problem, Microsoft Office are not suitable for headless automation and cannot be reliably depended on to host the benchmark test

3) Benchmarks can change as we develop. Ideally, we want lower and lower time as we improve. To be meaningful, we need to have memory of previous benchmark to assess the deviation from the past performance without failing build where it's now 1ms slower

4) On the other hand, we shouldn't penalize when we add a new inspection, which by definition will always add to the overall time. For that reason, we need to segregate the statistics accordingly. One way we can do this is to assess parse time on its own and then for each inspection individually.

In light of this, there are few options we can consider:

1) Instead of going for full monty, we can settle with providing a set of files that can be imported into a mock VB* project. That would go through the full parsing cycle & inspection cycle from a cold cache. This won't measure the time toward loading the modules but all things considered, it's not something we can optimize but we can always optimize the parsing/inspecting aspects. Thus, it serves our purpose to use a mock VB host/project.

Furthermore, I think most can agree that the parsing/inspecting is the most mission-critical part of the project. It's not the end of the world if a quickfix takes 1 or 2 seconds longer, but it would be if the parsing took 1 second longer as that hamper the ability to do multiple refactors.

2) AppVeyor has a build cache which we can (ab)use to accumulate past benchmark performances for each build. There might be alternatives (e.g. database on Rubberduck's web) but this seems simplest way to setup without creating external dependency from the AppVeyor environment.

3) The test could be set up to fail only if the benchmark exceeds some threshold. One possible is to check for exceeding a standard deviation of last N builds. This way, it won't penalize if it's only a bit slower but will stop the build when it's considerably slower. The other possibility is to use Duga or some GitHub bot to broadcast the performance so the contributors can make decisions.

Feasible? Can it be better? Thoughts?

mansellan commented 6 years ago

Is it worth looking into the licensing issues for VS6? If it's possible to host an actual VB6 project it might give us a better view of real-world performance.

bclothier commented 6 years ago

I have my doubts about how "real-world" it will be because AppVeyor must be able to run headless, which means running VS6 in the command line mode. That would not correspond that much to the typical use with a GUI involved.

Besides, it's really not the GUI operations that we need to worry about anyway. It's how fast we can keep up with the GUI without flickering or stalling or disabling features because we haven't got the data.

PeterMTaylor commented 6 years ago

My thoughts of a benchmark more towards a (RD’s new) defined stress test environment might be one angle to consider possibly that automates appveyor session separately after a build process to make release files. The statistics from the stress test can be verbose enough as we like to help RD understand as the outcome of the parse session compared to inspection anaylsis, etc.. I think it’s flexible to say whatever module/class/forms/reports we provide this environment can tell us how the parse benchmark and how overall RD holds up.

PeterMTaylor commented 6 years ago

Still never hear of cold cache compared to any cache holding persistent data long enough.

bclothier commented 6 years ago

From the chat there's a concern whether AV VM is consistent enough for benchmarking. It is possible the "benchmark" may not be the best word as we have 2 different goals that can be achieved using the same techniques being discussed:

  1. we want a early detection of any anomalous slowdown in the performance
  2. we want to measure the performance to ensure that we are improving.

Both can use same technique to test but the first can allow more lenient standard (e.g. only fail if it's a standard deviation or 2 out of the norm). In this sense, we are only providing a relative measurement, not an absolute measurement to guard against unwanted slowdown that would be difficult to detect from a cursory glance through the PR changes.

It's the 2nd goal that requires more tighter constraint and might be impossible to realize using a shared VM that AppVeyor is surely using. That said, I don't think this precludes us from satisfying the first goal of detecting anomalous slowdown.