Open louis-langholtz opened 7 years ago
Ideally this website integrates with the benchmark data produced from Benchmark.
Thus no servlet stuff required, gh-pages easily-integrated solution (just like API/index.html), script launching can be integrated into benchmark launching (exec right after), allows both Total (sum up) metrics and Per-Benchmark metrics.
The first part under your control, the second - python I don't know, the third I could complete my own.
Just the thoughts, what do you think?
I like your suggestion, yes.
Here's an ordered list of where I'm at now in my thinking on this:
These two tasks seem relatively easy to accomplish and are in fact done now.
Beyond the "easy" tasks, then:
gh-pages
branch such that the raw data can be directly downloaded/viewed from the gh-pages
web interface. Unfortunately, reading the Travis-CI Continuous Performance Testing feature request issue, there seems to be repeated suggestion that Travis-CI VMs just aren't reliable enough to doing these kinds of performance measurements.
Sounds fair. I have made a draft of what could it look like. Turns out buildTime in data.json is not neccessary though.
I'm impressed with the draft you made! (I love it!!)
And I agree, that build time is a less important metric than run-times of the actual benchmark tests.
Beyond the "easy" tasks (which are done now), I'm not sure what to do about the performance variability of the Travis-CI system. Explaining times increasing would be too problematic I fear and too apt to give the project wrong first impressions if used like this (through one of the existing CI systems).
So I'm leaning on the generation and store of reports not being part of any CI system but rather being dealt with manually by contributing developers. Sounds more like what I think you were originally suggesting. This way developers interested in running the benchmark application to take the care needed to shut down unnecessary processes/tasks/threads to get more reliable results.
I see your point. The thing is there must be a single pc used for benchmarking, else it would rely heavily on pc config of user, absolutely unreliable.
Possible solution is a virtual machine with CPU limit running the google benchmark, but that is a hard setup for anyone.
The possible workaround to the problem:
Usually, if solution B is faster than A, it will be so on any pc. Now contributor wants to improve test FloatAlmostZero performance. He launches a script to compile and execute EVERY version of this benchmark, lasting with his new version. This metrics should keep the general tend of benchmark build performances, it is the height of metrics that will be rescaled depending on PC performance, which is ok. Overall performance will not suffer in general tend as it is calculated from summing up specific metrics which only scale all builds at once.
Not quite sure is it possible in continious intergation scheme though, git commit reverse probably can work to obtain each version of benchmark.
**
Also, an easy fix for CI-Travis fluctuations could be exclusion of small tests from Overall metrics ( < 1000 ms ), not sure though. Maybe manual timing can fix issues of 2ms->3ms->2ms (** meant 2ns->3ns->2ns, seems like chrono high res supports u-seconds) benchmarking of float multiplication and others, if it is the cause of the problem you described.
@Hexlord That's a pretty nice draft!
Just looking for clarification: is the goal here to be able to track the performance changes caused by modifications to the PlayRho library, while attempting to filter any performance hits or gains caused by compiler / OS / build architecture updates?
Also, are we benchmarking the building times as well, or just the speed of the compiled code across X different test cases?
@NauticalMile64 Yes, kind of, I've meant running tests on Pentium 4 againts core i7 after proposition of user input of benchmark results, but yours snippy affects stuff too. Using Travis-CI is unreliable in performance metrics stability for known reasons. Best solution is a maintained server whos benchmark results can be trusted. No easy solution so far.
It occurs to me that I don't really think PlayRho really has a good enough timing test suite yet to put much energy yet into performance trend analysis. I like to run the Benchmark every now and again especially when I try changes that could be more performance critical but the results I see just running it on my own desktop vary more than I'm comfortable with making much out of. I mostly only want to keep this issue open because I don't want to loose sight of it for the longer run.
Well, how about providing metrics only for releases showing frame time decrease for each TestBed scene, with bottom line as Box2D's frame time, with Screenshot comparing scenes squishiness being shown on nodes when mouse hovered, that would also possibly eliminate timing issues, but also going to look much proper for presenting. I could prepare such a thing if you are interested.
While not automated, I've created a Wiki page for Benchmark Data. Needs help though from anyone who can build and run the Benchmark application on their hardware and submit the data for it.
Would be awesome to have a website that provided benchmarking trend analysis. Something like coveralls.io that showed the historical trend but for performance metrics (instead of unit test code coverage metrics).
Does such a website/service even exist?