MueLu uncoupled aggregation not threaded

pwxy commented 7 years ago

Trilinos was built with kokkos node type OpenMP. Looked at thread scaling for a single KNL. Running with 1 MPI process and increasing number of threads from 1 to 64 (so have one thread per core).

Although the matrix-matrix multiplication (kokkoskernels SPGEMM) thread scales well, the uncoupled aggregation does not thread scale at all. For 1 thread, uncoupled aggregation accounts for 1/3 of the MueLu setup time, and 40% of the uncoupled aggregation time (so 1/8 of MueLu setup time) is due to the use of std::sort, which is not threaded (either lines 916 or 999 or both of muelu/src/Graph/MueLu_CoalesceDropFactory_def.hpp). As increase the number of threads, and because RAP thread scales well, std::sort becomes a larger fraction of the time. By the time get to 64 threads, matrix-matrix multiplication is only ~7% of MueLu setup time and uncoupled aggregation 90% of the time.

times in seconds

t MLsetup MLrap MLagg MLcoal MLagg1 MLagg2a MLagg3a 1 259.4 174.9 78.2 58.2 8.0 3.8 6.5
2 173.0 89.9 78.1 58.1 8.0 3.8 6.5
4 130.1 47.4 78.2 58.2 8.0 3.8 6.5
8 107.6 25.5 78.2 58.2 8.0 3.8 6.5
16 96.4 14.4 78.2 58.2 8.0 3.8 6.5 32 90.8 9.0 78.2 58.2 8.0 3.8 6.5 64 88.0 6.2 78.2 58.2 8.0 3.8 6.5

legend: MLsetup="MueLu Hierarchy: Setup (total)" MLrap="MueLu RAPFactory: Computing Ac" (not "total") MLagg="UncoupledAggregationFactory: Build (total)" MLcoal="MueLu CoalesceDropFactory: Build" (not "total") MLagg1="MueLu AggregationPhase1Algorithm: BuildAggregates" MLagg2a="MueLu AggregationPhase2aAlgorithm: BuildAggregates" MLagg2b="MueLu AggregationPhase2bAlgorithm: BuildAggregates"

What would be the schedule for threading uncoupled aggregation?

aprokop commented 7 years ago

@trilinos/muelu

@pwxy I think there is some confusion here. First, src/MueLu_CoalesceDropFactory files are not uncoupled aggregation, just graph creation. The uncoupled aggregation itself has never taken anywhere close to 1/8 of the total time, and you can see them in your MLagg* timers.

We have a threaded version of the CoalesceDropFactory, but it requires MueLu to be configured with MueLu_ENABLE_Kokkos=ON and use kokkos: true in the parameter list. It should work for the scalar case, but I'm not sure if it works for block case, @tawiesn should know.

pwxy commented 7 years ago

@aprokop Sorry, I was being sloppy. What I was reporting as "uncoupled aggregation time" was the MueLu UncoupledAggregationFactory: "Build (total)" rather than just "Build" (not "total"). "Build (total)" would include all the extra time for stuff needed, such as graph creation from MueLu_Coalesce Drop. While "Build" (not "total") is strictly the uncoupled aggregation time, the sum of phase1, phase2a and 2b.

So I need MueLu_ENABLE_Kokkos=ON in addition to:

 -D MueLu_ENABLE_Experimental:BOOL=ON
 -D MueLu_ENABLE_Kokkos_Refactor:BOOL=ON

Thanks for letting me know about the threaded version of CoalesceDropFactory When you say that the "It should work for the scalar case, but I'm not sure if it works for block case." By "block case", do you mean systems of PDEs? For drekar, only using Tpetra point CRS rather than blockCRS, but have systems of PDEs.

pwxy commented 7 years ago

@aprokop I tried adding " MueLu_ENABLE_Kokkos=ON" to the list of cmake options, but I got:

Manually-specified variables were not used by the project:

MueLu_ENABLE_Kokkos

what am I doing wrong?

pwxy commented 7 years ago

Looking through the source, I saw that @aprokop meant:

-D MueLu_ENABLE_Kokkos_Refactor:BOOL=ON

          <Parameter name="use kokkos refactor"                      type="bool"     value="true"/>

aprokop commented 7 years ago

@pwxy Sorry, you are right, I was doing it from memory.

pwxy commented 7 years ago

I assume MueLu has been reasonably tested with?

"use kokkos refactor"=true

With this MueLu parameter, the entire calculation takes 10x longer. Even with only 1 thread, it is ghastly slow. Something does not seem right. If it matters, I am using the May 24, 2017 version of develop Trilinos

Output seems to take forever to get past the "Stage 1 (LocalQR)" line:

Level 1 Build (N5MueLu24RebalanceTransferFactoryIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE) Build (N5MueLu18RepartitionFactoryIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE) Computing Ac (N5MueLu10RAPFactoryIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE) Build (N5MueLu24TentativePFactory_kokkosIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE) Build (N5MueLu34UncoupledAggregationFactory_kokkosIixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE) Build (N5MueLu26CoalesceDropFactory_kokkosIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE) Build (N5MueLu19AmalgamationFactoryIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE) AmalagamationFactory::Build(): found fullblocksize=4 and stridedblocksize=4 from strided maps. offset=0

    ******* WARNING *******
    lightweight wrap is deprecated
    algorithm = "classical": threshold = 0.00, blocksize = 4

    ******* WARNING *******
    Level::Set: unable to store "Filtering" generated by factory 0x234d7ea0 on level 0, as it has not been requested and no keep flags were set for it
    Detected 0 Dirichlet nodes
   Algo "Phase - (Dirichlet)"
    BuildAggregates (Phase - (Dirichlet))
      aggregated : 0 (phase), 0/8000 [0.00%] (total)
      remaining  : 8000
      aggregates : 0 (phase), 0 (total)
   Algo "Phase 1 (main)"
    BuildAggregates (Phase 1 (main))
      aggregated : 4267 (phase), 4267/8000 [53.34%] (total)
      remaining  : 3733
      aggregates : 17 (phase), 17 (total)
   Algo "Phase 2a (secondary)"
    BuildAggregates (Phase 2a (secondary))
      aggregated : 1518 (phase), 5785/8000 [72.31%] (total)
      remaining  : 2215
      aggregates : 8 (phase), 25 (total)
   Algo "Phase 2b (expansion)"
    BuildAggregates (Phase 2b (expansion))
      aggregated : 2215 (phase), 8000/8000 [100.00%] (total)
      remaining  : 0
      aggregates : 0 (phase), 25 (total)
   Algo "Phase 3 (cleanup)"
    BuildAggregates (Phase 3 (cleanup))
      aggregated : 0 (phase), 8000/8000 [100.00%] (total)
      remaining  : 0
      aggregates : 0 (phase), 25 (total)
   "UC": N5MueLu17Aggregates_kokkosIixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE{nGlobalAggregates = 25}
  Build (N5MueLu23CoarseMapFactory_kokkosIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE)
  Calc AggSizes
  Create Agg2RowMap
  Stage 1 (LocalQR)

aprokop commented 7 years ago

I assume MueLu has been reasonably tested with?

The answer is no :) This needs further testing. I'm pretty sure some of the things are not optimized, and I don't think we've done full studies running full simulation with it. I think @jhux2 may have a better idea.

mhoemmen commented 7 years ago

@aprokop Could we just paste in Kokkos::sort, independently of whether MueLu is doing threaded aggregation?

jhux2 commented 7 years ago

I assume MueLu has been reasonably tested with? The answer is no :) This needs further testing. I'm pretty sure some of the things are not optimized, and I don't think we've done full studies running full simulation with it. I think @jhux2 may have a better idea.

TentativePFactory has recently been converted. Let's wait for @tawiesn to chime in.

jhux2 commented 7 years ago

@pwxy Note that Stage 1 (LocalQR) is from TentativePFactory, not aggregation. I just ran the MueLu Driver locally with/without Kokkos enabled, and I see that the time for TentativePFactory increases, which is consistent with what you're seeing.

aprokop commented 7 years ago

@pwxy @mhoemmen

The only std::sort that I see is in non-threaded version of MueLu. It does the sort for Teuchos::Array. I would guess we could use Kokkos::sort in that version too, especially if it's such a bottleneck, but that means introducing explicit Kokkos code into the old version which we have been avoiding so far.

Looking at the actual code that calls std::sort, it seems that it simply tries to merge the indices of multiple rows into a single array. So, essentially, we do n/N sorts where N is the block size, with each sort being approximately N*nnz_per_row. Theoretically, if the row indices are sorted, this could be done linearly through merging. I'm pretty sure if one looks carefully at the code, we could get rid of some inefficiencies in it.

jhux2 commented 7 years ago

Issue appears to be in the call to CreateAgg2RowMapLOFunctor.

mhoemmen commented 7 years ago

@aprokop wrote:

I would guess we could use Kokkos::sort in that version too, especially if it's such a bottleneck, but that means introducing explicit Kokkos code into the old version which we have been avoiding so far.

The sort function is self-contained enough that MueLu could wrap it, and dispatch to Kokkos if that package is enabled.

jhux2 commented 7 years ago

Issue appears to be in the call to CreateAgg2RowMapLOFunctor.

That was for NumPDEs=1.

For NumPDEs=3, largest timer is for LocalQR, followed by Agg2RowMap.

tawiesn commented 7 years ago

@jhux2 @mhoemmen @aprokop Several comments: my LocalQR needed in TentativePFactory_kokkos is known to be super-slow. I implemented it by hand with a focus on correctness and not performance. We need a fast Kokkos-enabled local QR decomposition in KokkosKernels. I've already talked to Siva about that. For efficiency and performance one probably needs some kind of batched local QR routines.

In the NumPDE>1 version of CoalesceDropFactory we need a local sort. Again, since there is nothing like that in KokkosKernels i implemented a simple and slow bubble sort algorithm. We cannot call std::sort within kernels, and we cannot use Kokkos::sort as this implements a plain sort (and not aggregate-wise local sorting). We would need something like sort2 which sorts by two keys...

The kokkos versions of the UncoupledAggregation routine are not parallel. The aggregation algorithms use kokkos views, but there are no parallel for's inside the algorithm.

pwxy commented 7 years ago

@tawiesn thanks for the clarification. I'm guessing it will be a couple of months before these issues get resolved?

aprokop commented 7 years ago

@tawiesn

We need a fast Kokkos-enabled local QR decomposition in KokkosKernels. I've already talked to Siva about that. For efficiency and performance one probably needs some kind of batched local QR routines.

While yes, in the long term that would be true, in the short term, I don't see why running on a single thread would be that much slower than a serial implementation.

In the NumPDE>1 version of CoalesceDropFactory we need a local sort. Again, since there is nothing like that in KokkosKernels i implemented a simple and slow bubble sort algorithm. We cannot call std::sort within kernels, and we cannot use Kokkos::sort as this implements a plain sort (and not aggregate-wise local sorting). We would need something like sort2 which sorts by two keys...

I don't think you have to. At the moment, it looks like you merge individual rows one by one and simultaneously sort. You could do the same way as the original (non-Kokkos) factory, and first merge all dofs into a single array and then sort.

tawiesn commented 7 years ago

@pwxy I have no idea what the roadmap is to fix these performance issues.

tawiesn commented 7 years ago

@csiefer2 I think most of the immanent performance problems are fixed (at least for the scalar case), right? Any new results?

csiefer2 commented 7 years ago

I didn't address the sort issue at all, but fixed other stuff in Phase-1 Uncoupled.

Perhaps a rerun of the scaling test would be in order...

aprokop commented 7 years ago

Interested parties are could try current kokkos branch in aprokop/Trilinos.

jhux2 commented 4 years ago

Uncoupled aggregation is threaded (thanks to @lucbv's hard work).

trilinos / Trilinos

MueLu uncoupled aggregation not threaded #1366