Change order of array allocation to improve reference kernel performance

The following email thread captures the options available to improve performance of the reference HPCG kernels. The first suggestion (allocating mtxIndL, matrixValues and mtxIndG in separate loops) certainly makes sense. The second suggestion (allocating a single big array and setting pointers to point inside) is certainly superior for current mainline architectures, but reduces code readability. It also eliminates the broader value of an array-of-pointers data structure to be able to re-allocate data on a row-by-row basis.

I won't implement the second suggestion.

Mike H., If the modification is acceptable, it could be improved even further, to better ensure compaction/contiguity in the arrays, with something like:

// Now allocate the arrays pointed to mtxIndL[0] = new local_int_t[localNumberOfRows * numberOfNonzerosPerRow]; matrixValues[0] = new double[localNumberOfRows * numberOfNonzerosPerRow]; mtxIndG[0] = new global_int_t[localNumberOfRows * numberOfNonzerosPerRow]; for (local_int_t i=1; i< localNumberOfRows; ++i) { mtxIndL[i] = &mtxIndL[0] + i * numberOfNonzerosPerRow; matrixValues[i] = &matrixValues[0] + i * numberOfNonzerosPerRow; mtxIndG[i] = &mtxIndG[0] + i * numberOfNonzerosPerRow; }

Mike Davis Cielo Applications Analyst Cray Inc. / Sandia National Laboratories

From: Heroux, Michael A Sent: Tuesday, August 18, 2015 10:41 AM To: Davis, Mike E Cc: Rajan, Mahesh; Bookey, Zachary A; Demeshko, Irina Petrovna (-EXP); Rajamanickam, Sivasankaran (-EXP) Subject: Re: hpcg question

Hi Mike,

This is an interesting observation. I can certainly add your version of the loops to the reference code, under the assumption that the change would be beneficial in general, which is probably a reasonable guess for any cache-based micro processor.

Thanks for sending this to me. Although the reference version of HPCG is not intended to be performant, there is no reason to avoid general performance improvements.

I am copying a student and colleagues, Zach Bookey, Irina Demeshko and Siva Rajamanickam, resp, who are working on a Kokkos version of the code, in case the same optimization could be helpful for them.

Thanks again.

Mike

From: "Davis, Mike E" medavis@sandia.gov Date: Tuesday, August 18, 2015 at 11:05 AM To: Michael A Heroux maherou@sandia.gov Cc: Mahesh Rajan mrajan@sandia.gov, "Davis, Mike E" medavis@sandia.gov Subject: hpcg question

Mike H., I’ve been doing some runs of HPCG and have found that I get a significant speedup out of ComputeSYMGS when I rearrange the order of allocations of arrays in GenerateProblem. My change to GenerateProblem is shown below (a separate loop for each array). My question is, is this a legitimate change to make in the code? Or might you consider making this change if it turns out to benefit everyone? Or is the current method (with vectors broken up) more representative of “the real world”? Thanks for any feedback you can provide.

// Now allocate the arrays pointed to for (local_int_t i=0; i< localNumberOfRows; ++i) { mtxIndL[i] = new local_int_t[numberOfNonzerosPerRow]; matrixValues[i] = new double[numberOfNonzerosPerRow]; mtxIndG[i] = new global_int_t[numberOfNonzerosPerRow]; }

// Now allocate the arrays pointed to for (local_int_t i=0; i< localNumberOfRows; ++i) { mtxIndL[i] = new local_int_t[numberOfNonzerosPerRow]; } for (local_int_t i=0; i< localNumberOfRows; ++i) { matrixValues[i] = new double[numberOfNonzerosPerRow]; } for (local_int_t i=0; i< localNumberOfRows; ++i) { mtxIndG[i] = new global_int_t[numberOfNonzerosPerRow]; }

Mike Davis Cielo Applications Analyst Cray Inc. / Sandia National Laboratories

hpcg-benchmark / hpcg

Change order of array allocation to improve reference kernel performance #4