Do more per VM invocation

GoogleCodeExporter commented 9 years ago

Suppose I'm testing 4 x 3 different parameter values against 5 different 
benchmarks (different time- methods in the same class) on 2 vms.  
Currently, to get one measurement each, we'll run 4x3x5x2=120 vm 
invocations.  I think 10 would be enough -- vms times benchmarks, and let 
each run handle all 4 x 3 parameter combinations for that (vm,benchmark) 
pair.

The problem with the way it is today is that hotspot can optimize away 
whole swaths of implementation code that doesn't happen to get exercised by 
the *one* scenario we run it with. By warming up all 12 of these benchmark 
instances, it should have to compile to something more closely resembling 
real life (maybe).  And with luck, we can avoid the expense of repeating 
the warmup period 12 times over.

After warming up all the different scenarios and then starting to do trials 
of one of them, I'm not sure if we need to worry about hotspot deciding to 
*re*compile based on the new favorite scenario.  If that happens, maybe it 
makes sense for us to round-robin through the scenarios as we go...... 
we'll see.

I'm also not sure how concerned we need to be that the order the scenarios 
are timed in can unduly affect the results. It could be that for each 
"redundant" measurement we take, we vary up the order (e.g. we rotate it?) 
in order to wash that out.  Or maybe there's no problem with this; I dunno.

Original issue reported on code.google.com by kevinb@google.com on 22 Jan 2010 at 10:53

GoogleCodeExporter commented 9 years ago

note that this is a correctness issue -- caliper is currently reporting totally 
bogus 
results.

Original comment by kevinb@google.com on 7 Jun 2010 at 5:48

Changed title: Do more per VM invocation
Added labels: Milestone-0.5
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Original comment by kevinb@google.com on 14 Jan 2011 at 11:09

Added labels: Milestone-1.0
Removed labels: ****

GoogleCodeExporter commented 9 years ago

I can definitely tell that the order of warming up will affect HotSpot 
statistics and if different choices are made, the results will be different. 
This is the case we have in JUnitBenchmarks project -- the order of JUnit tests 
turned into benchmarks does affect the outcome (I only noticed this when 
compared the results against the same test executed in Caliper in separate VMs).

Original comment by dawid.weiss@gmail.com on 4 Mar 2011 at 4:43

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Original comment by kevinb@google.com on 19 Mar 2011 at 2:13

Added labels: Milestone-Post-1.0
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Original comment by kevinb@google.com on 19 Mar 2011 at 3:06

Added labels: Type-Enhancement
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Original comment by kevinb@google.com on 8 Feb 2012 at 9:49

Added labels: Component-Runner
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Original comment by kevinb@google.com on 1 Nov 2012 at 8:32

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

+martin is expressing some concern about this too.

Original comment by kevinb@google.com on 11 Oct 2013 at 9:40

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

I have also seen benchmark results highly dependent on the order of warmup - 
JIT optimizes for the profile collected during first method warmup.  That said, 
I would in general prefer to have my methods run in the same VM, to make 
execution less artificial.  So I support this change, but y'all had better vary 
warmup order.

Original comment by marti...@google.com on 11 Oct 2013 at 9:48

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

I remain skeptical that such a change would make a benchmark less artificial.  
They would just be artificial in a a different way that is less predictable.  
We know that microbenchmarks are somewhat contrived anyway, so half-efforts 
toward making them "realistic" feel a little futile.

There's also a fair bit of work to be done to figure out how to make this work 
since the relatively simple estimation work that we do to guess at reps for 
microbenchmark warmup would need to be replaced.

That said, I see no reason to have it as an option if we can devise a strategy 
to make it work.

I should also mention that if our primary concern is total run execution time, 
the far easier target than this warmup business is just running instruments 
that aren't sensitive to resource constraints (e.g.: the allocation instrument) 
concurrently.

Original comment by gak@google.com on 12 Oct 2013 at 2:37

Added labels: ****
Removed labels: ****

toshsan / caliper

Do more per VM invocation #32