ktbriggs-gc / omr

Eclipse OMR™ Cross platform components for building reliable, high performance language runtimes
http://www.eclipse.org/omr
Other
0 stars 0 forks source link

Regression in 68e1176b #1

Closed ktbriggs-gc closed 3 years ago

ktbriggs-gc commented 3 years ago

Last commit introduced a bug that prevents the outside copyspaces from refreshing as intended while tail-filling. This degrades GC throughput overall but especially in xml.validation benchmark runs, where frequent stack overflow exacerbates the issue. This manifests as high (>1000) stf/ttf counts in trace output, and also increased global gc activity in xml.validation. Working on a fix for this...

ktbriggs-gc commented 3 years ago

Regression was due to a problem MM_Evacuator::selectCopyspace() introduced when I rearranged the logic for admitting objects for outside copy. Fixed in MM_Evacuator::selectCopyspace().

evacuator xml.validation
start   20:43:03
end     20:51:17
score   194.21
cpu=391%; time=495.17; time-sliced=225288; wait=81472

scavenger xml.validation
start   20:55:18
end     21:03:49
score   188.08
cpu=395%; time=511.28; time-sliced=228340; wait=92569

I will run more regression benchmarks and publish summary results when fix is committed.

There is some performance rework coming with this fix, mainly to eliminate some unnecessary read barriers and atomic operations that had been introduced some time ago as a safeguard while working out multithreading issues. The gcc compiler fence is adequate for present purposes, but some barriers might have to be reintroduced for other platforms/compilers. All volatile data are declared as such. Most of these are protected by a mutex except the controller uses atomic add() to track survivor/tenure memory allocation volumes and atomic bitwise OR/AND is used to register/unregister evacuator threads in the bound evacuator bit map when they come online and exit.

There is some light refactoring and some comment updates along with changes for the fix, for the most these are not substantial from a performance perspective and are intended to improve readabilty and maintenance.

ktbriggs-gc commented 3 years ago

Closing this, see commit comment for changes. Summary benchmarking results for this commit (2 runs per benchmark per gc configuration) are below. See EvacuatorOverview.pdf for an explanation of the benchmarking setup and metrics shown here. Detail spreadsheets and raw data (vgc log, SPEC log, evacuator/scavenger trace output) are also available on request.

image

ktbriggs-gc commented 3 years ago

I'm reopening this because there were still some residual high stf/ttf (survivor/tenure tail filling condition) counts in the trace output. These were present before this issue was first raised but had a different origin, occurring only when inside and outside copyspaces were synchronously tail filling. This is a relatively rare event but it is not being handled well (primitives are forced into the overflow copyspace, inhibiting outside copyspace refreshment). I have a fix for this that I will commit after benchmarking.

Next commit will also change how copy is directed while scanning inside the topmost stack frame. The previous commit revealed a way to handle this. The change was a fluke but was responsible for the high cache line containment seen in the benchmarking results posted above. Key is to force inside copy whenever possible while scanning the first object copied into the frame but inhibit inside copy while scanning the objects copied out of the first object, so that the frame is popped just after scanning the last copied referent. To see why this makes a difference consider scanning an array of String in *(nil-2). Each String is pushed into *(nil-1) and scanned with no option to push up the stack. Previously the referent char[] array was being forced outside, without collocation with referring String, for each array element. This change ensures that these associated (String, char[]) objects are almost always collocated in the topmost frame as intended, and similarly for other objects that are first to be copied and scanned inside the topmost frame.

ktbriggs-gc commented 3 years ago

Closing this, benchmarking results below. I've included the coefficients of correlation (ρ) between interval-ms and cache%.

image

ktbriggs-gc commented 3 years ago

Coefficients of correlation between kb/ms (gc thruput) and cache% from the benchmarking runs above:

image

Note that the gc hits these (collocated references) always and only when copying the referent. During subsequent gc cycles the application may never hit them (they dissipate as their endpoints die young) or may hit them repeatedly and often. For example, almost all of derby's activity takes place in tenure space, where the entire database is maintained in RAM. This likely accounts for the high -ve correlation between collocation and interval times seen with derby (cache interval).