louis-langholtz / PlayRho

An interactive physics engine & library.
zlib License
130 stars 24 forks source link

No website for CI benchmarking and trend analysis #7

Open louis-langholtz opened 7 years ago

louis-langholtz commented 7 years ago

Would be awesome to have a website that provided benchmarking trend analysis. Something like coveralls.io that showed the historical trend but for performance metrics (instead of unit test code coverage metrics).

Does such a website/service even exist?

louis-langholtz commented 7 years ago

Ideally this website integrates with the benchmark data produced from Benchmark.

Hexlord commented 7 years ago
  1. You could extend console reporter to store each report in a seperate file (report_08_28_2017_14_18.txt) and commit-push it to some playrhobenchmark repo.
  2. Then some .py script to form a JSON out of parsed reports filled with performance metrics.
  3. Then a simple benchmark.html on jQuery to load the generated JSON and render some metrics, jqueryui or bootstrap could do the trick.

Thus no servlet stuff required, gh-pages easily-integrated solution (just like API/index.html), script launching can be integrated into benchmark launching (exec right after), allows both Total (sum up) metrics and Per-Benchmark metrics.

The first part under your control, the second - python I don't know, the third I could complete my own.

Just the thoughts, what do you think?

louis-langholtz commented 7 years ago

I like your suggestion, yes.

Here's an ordered list of where I'm at now in my thinking on this:

  1. Get issue #18 taken care of.
  2. Add building of the benchmark application to the Travis-CI build components (issue #53).

These two tasks seem relatively easy to accomplish and are in fact done now.

Beyond the "easy" tasks, then:

  1. Tie in the execution of the benchmark application to the on-success event of unit tests execution.
  2. Add basic archiving of the benchmark output to the gh-pages branch such that the raw data can be directly downloaded/viewed from the gh-pages web interface.
  3. Add a JavaScript/TypeScript assisted visual interface to the benchmark output that provides the trend analysis. I'd hoped to avoid writing this but I'm not aware of something like this as a service that's already available.

Unfortunately, reading the Travis-CI Continuous Performance Testing feature request issue, there seems to be repeated suggestion that Travis-CI VMs just aren't reliable enough to doing these kinds of performance measurements.

Hexlord commented 7 years ago

Sounds fair. I have made a draft of what could it look like. Turns out buildTime in data.json is not neccessary though.

louis-langholtz commented 7 years ago

I'm impressed with the draft you made! (I love it!!)

And I agree, that build time is a less important metric than run-times of the actual benchmark tests.

Beyond the "easy" tasks (which are done now), I'm not sure what to do about the performance variability of the Travis-CI system. Explaining times increasing would be too problematic I fear and too apt to give the project wrong first impressions if used like this (through one of the existing CI systems).

So I'm leaning on the generation and store of reports not being part of any CI system but rather being dealt with manually by contributing developers. Sounds more like what I think you were originally suggesting. This way developers interested in running the benchmark application to take the care needed to shut down unnecessary processes/tasks/threads to get more reliable results.

Hexlord commented 7 years ago

I see your point. The thing is there must be a single pc used for benchmarking, else it would rely heavily on pc config of user, absolutely unreliable.

Possible solution is a virtual machine with CPU limit running the google benchmark, but that is a hard setup for anyone.

The possible workaround to the problem:

Usually, if solution B is faster than A, it will be so on any pc. Now contributor wants to improve test FloatAlmostZero performance. He launches a script to compile and execute EVERY version of this benchmark, lasting with his new version. This metrics should keep the general tend of benchmark build performances, it is the height of metrics that will be rescaled depending on PC performance, which is ok. Overall performance will not suffer in general tend as it is calculated from summing up specific metrics which only scale all builds at once.

Not quite sure is it possible in continious intergation scheme though, git commit reverse probably can work to obtain each version of benchmark.

**

Also, an easy fix for CI-Travis fluctuations could be exclusion of small tests from Overall metrics ( < 1000 ms ), not sure though. Maybe manual timing can fix issues of 2ms->3ms->2ms (** meant 2ns->3ns->2ns, seems like chrono high res supports u-seconds) benchmarking of float multiplication and others, if it is the cause of the problem you described.

NauticalMile64 commented 7 years ago

@Hexlord That's a pretty nice draft!

Just looking for clarification: is the goal here to be able to track the performance changes caused by modifications to the PlayRho library, while attempting to filter any performance hits or gains caused by compiler / OS / build architecture updates?

Also, are we benchmarking the building times as well, or just the speed of the compiled code across X different test cases?

Hexlord commented 7 years ago

@NauticalMile64 Yes, kind of, I've meant running tests on Pentium 4 againts core i7 after proposition of user input of benchmark results, but yours snippy affects stuff too. Using Travis-CI is unreliable in performance metrics stability for known reasons. Best solution is a maintained server whos benchmark results can be trusted. No easy solution so far.

louis-langholtz commented 7 years ago

It occurs to me that I don't really think PlayRho really has a good enough timing test suite yet to put much energy yet into performance trend analysis. I like to run the Benchmark every now and again especially when I try changes that could be more performance critical but the results I see just running it on my own desktop vary more than I'm comfortable with making much out of. I mostly only want to keep this issue open because I don't want to loose sight of it for the longer run.

Hexlord commented 7 years ago

Well, how about providing metrics only for releases showing frame time decrease for each TestBed scene, with bottom line as Box2D's frame time, with Screenshot comparing scenes squishiness being shown on nodes when mouse hovered, that would also possibly eliminate timing issues, but also going to look much proper for presenting. I could prepare such a thing if you are interested.

louis-langholtz commented 7 years ago

While not automated, I've created a Wiki page for Benchmark Data. Needs help though from anyone who can build and run the Benchmark application on their hardware and submit the data for it.

Hexlord commented 7 years ago

Shoulda use rdtsc for precise timing of micro ops