Closed pwxy closed 4 years ago
@trilinos/muelu
@pwxy I think there is some confusion here. First, src/MueLu_CoalesceDropFactory
files are not uncoupled aggregation, just graph creation. The uncoupled aggregation itself has never taken anywhere close to 1/8 of the total time, and you can see them in your MLagg* timers.
We have a threaded version of the CoalesceDropFactory, but it requires MueLu to be configured with MueLu_ENABLE_Kokkos=ON
and use kokkos: true
in the parameter list. It should work for the scalar case, but I'm not sure if it works for block case, @tawiesn should know.
@aprokop Sorry, I was being sloppy. What I was reporting as "uncoupled aggregation time" was the MueLu UncoupledAggregationFactory: "Build (total)" rather than just "Build" (not "total"). "Build (total)" would include all the extra time for stuff needed, such as graph creation from MueLu_Coalesce Drop. While "Build" (not "total") is strictly the uncoupled aggregation time, the sum of phase1, phase2a and 2b.
So I need MueLu_ENABLE_Kokkos=ON in addition to:
-D MueLu_ENABLE_Experimental:BOOL=ON
-D MueLu_ENABLE_Kokkos_Refactor:BOOL=ON
Thanks for letting me know about the threaded version of CoalesceDropFactory When you say that the "It should work for the scalar case, but I'm not sure if it works for block case." By "block case", do you mean systems of PDEs? For drekar, only using Tpetra point CRS rather than blockCRS, but have systems of PDEs.
@aprokop I tried adding " MueLu_ENABLE_Kokkos=ON" to the list of cmake options, but I got:
Manually-specified variables were not used by the project:
MueLu_ENABLE_Kokkos
what am I doing wrong?
Looking through the source, I saw that @aprokop meant:
-D MueLu_ENABLE_Kokkos_Refactor:BOOL=ON
<Parameter name="use kokkos refactor" type="bool" value="true"/>
@pwxy Sorry, you are right, I was doing it from memory.
I assume MueLu has been reasonably tested with?
"use kokkos refactor"=true
With this MueLu parameter, the entire calculation takes 10x longer. Even with only 1 thread, it is ghastly slow. Something does not seem right. If it matters, I am using the May 24, 2017 version of develop Trilinos
Output seems to take forever to get past the "Stage 1 (LocalQR)" line:
Level 1 Build (N5MueLu24RebalanceTransferFactoryIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE) Build (N5MueLu18RepartitionFactoryIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE) Computing Ac (N5MueLu10RAPFactoryIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE) Build (N5MueLu24TentativePFactory_kokkosIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE) Build (N5MueLu34UncoupledAggregationFactory_kokkosIixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE) Build (N5MueLu26CoalesceDropFactory_kokkosIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE) Build (N5MueLu19AmalgamationFactoryIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE) AmalagamationFactory::Build(): found fullblocksize=4 and stridedblocksize=4 from strided maps. offset=0
******* WARNING *******
lightweight wrap is deprecated
algorithm = "classical": threshold = 0.00, blocksize = 4
******* WARNING *******
Level::Set: unable to store "Filtering" generated by factory 0x234d7ea0 on level 0, as it has not been requested and no keep flags were set for it
Detected 0 Dirichlet nodes
Algo "Phase - (Dirichlet)"
BuildAggregates (Phase - (Dirichlet))
aggregated : 0 (phase), 0/8000 [0.00%] (total)
remaining : 8000
aggregates : 0 (phase), 0 (total)
Algo "Phase 1 (main)"
BuildAggregates (Phase 1 (main))
aggregated : 4267 (phase), 4267/8000 [53.34%] (total)
remaining : 3733
aggregates : 17 (phase), 17 (total)
Algo "Phase 2a (secondary)"
BuildAggregates (Phase 2a (secondary))
aggregated : 1518 (phase), 5785/8000 [72.31%] (total)
remaining : 2215
aggregates : 8 (phase), 25 (total)
Algo "Phase 2b (expansion)"
BuildAggregates (Phase 2b (expansion))
aggregated : 2215 (phase), 8000/8000 [100.00%] (total)
remaining : 0
aggregates : 0 (phase), 25 (total)
Algo "Phase 3 (cleanup)"
BuildAggregates (Phase 3 (cleanup))
aggregated : 0 (phase), 8000/8000 [100.00%] (total)
remaining : 0
aggregates : 0 (phase), 25 (total)
"UC": N5MueLu17Aggregates_kokkosIixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE{nGlobalAggregates = 25}
Build (N5MueLu23CoarseMapFactory_kokkosIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE)
Calc AggSizes
Create Agg2RowMap
Stage 1 (LocalQR)
I assume MueLu has been reasonably tested with?
The answer is no :) This needs further testing. I'm pretty sure some of the things are not optimized, and I don't think we've done full studies running full simulation with it. I think @jhux2 may have a better idea.
@aprokop Could we just paste in Kokkos::sort, independently of whether MueLu is doing threaded aggregation?
I assume MueLu has been reasonably tested with? The answer is no :) This needs further testing. I'm pretty sure some of the things are not optimized, and I don't think we've done full studies running full simulation with it. I think @jhux2 may have a better idea.
TentativePFactory has recently been converted. Let's wait for @tawiesn to chime in.
@pwxy Note that Stage 1 (LocalQR)
is from TentativePFactory
, not aggregation. I just ran the MueLu Driver locally with/without Kokkos enabled, and I see that the time for TentativePFactory
increases, which is consistent with what you're seeing.
@pwxy @mhoemmen
The only std::sort
that I see is in non-threaded version of MueLu. It does the sort for Teuchos::Array
. I would guess we could use Kokkos::sort
in that version too, especially if it's such a bottleneck, but that means introducing explicit Kokkos code into the old version which we have been avoiding so far.
Looking at the actual code that calls std::sort
, it seems that it simply tries to merge the indices of multiple rows into a single array. So, essentially, we do n/N sorts where N is the block size, with each sort being approximately N*nnz_per_row. Theoretically, if the row indices are sorted, this could be done linearly through merging. I'm pretty sure if one looks carefully at the code, we could get rid of some inefficiencies in it.
Issue appears to be in the call to CreateAgg2RowMapLOFunctor
.
@aprokop wrote:
I would guess we could use
Kokkos::sort
in that version too, especially if it's such a bottleneck, but that means introducing explicit Kokkos code into the old version which we have been avoiding so far.
The sort function is self-contained enough that MueLu could wrap it, and dispatch to Kokkos if that package is enabled.
Issue appears to be in the call to CreateAgg2RowMapLOFunctor.
That was for NumPDEs=1.
For NumPDEs=3, largest timer is for LocalQR
, followed by Agg2RowMap
.
@jhux2 @mhoemmen @aprokop Several comments: my LocalQR needed in TentativePFactory_kokkos is known to be super-slow. I implemented it by hand with a focus on correctness and not performance. We need a fast Kokkos-enabled local QR decomposition in KokkosKernels. I've already talked to Siva about that. For efficiency and performance one probably needs some kind of batched local QR routines.
In the NumPDE>1 version of CoalesceDropFactory we need a local sort. Again, since there is nothing like that in KokkosKernels i implemented a simple and slow bubble sort algorithm. We cannot call std::sort within kernels, and we cannot use Kokkos::sort as this implements a plain sort (and not aggregate-wise local sorting). We would need something like sort2 which sorts by two keys...
The kokkos versions of the UncoupledAggregation routine are not parallel. The aggregation algorithms use kokkos views, but there are no parallel for's inside the algorithm.
@tawiesn thanks for the clarification. I'm guessing it will be a couple of months before these issues get resolved?
@tawiesn
We need a fast Kokkos-enabled local QR decomposition in KokkosKernels. I've already talked to Siva about that. For efficiency and performance one probably needs some kind of batched local QR routines.
While yes, in the long term that would be true, in the short term, I don't see why running on a single thread would be that much slower than a serial implementation.
In the NumPDE>1 version of CoalesceDropFactory we need a local sort. Again, since there is nothing like that in KokkosKernels i implemented a simple and slow bubble sort algorithm. We cannot call std::sort within kernels, and we cannot use Kokkos::sort as this implements a plain sort (and not aggregate-wise local sorting). We would need something like sort2 which sorts by two keys...
I don't think you have to. At the moment, it looks like you merge individual rows one by one and simultaneously sort. You could do the same way as the original (non-Kokkos) factory, and first merge all dofs into a single array and then sort.
@pwxy I have no idea what the roadmap is to fix these performance issues.
@csiefer2 I think most of the immanent performance problems are fixed (at least for the scalar case), right? Any new results?
I didn't address the sort issue at all, but fixed other stuff in Phase-1 Uncoupled.
Perhaps a rerun of the scaling test would be in order...
Interested parties are could try current kokkos
branch in aprokop/Trilinos.
Uncoupled aggregation is threaded (thanks to @lucbv's hard work).
Trilinos was built with kokkos node type OpenMP. Looked at thread scaling for a single KNL. Running with 1 MPI process and increasing number of threads from 1 to 64 (so have one thread per core).
Although the matrix-matrix multiplication (kokkoskernels SPGEMM) thread scales well, the uncoupled aggregation does not thread scale at all. For 1 thread, uncoupled aggregation accounts for 1/3 of the MueLu setup time, and 40% of the uncoupled aggregation time (so 1/8 of MueLu setup time) is due to the use of std::sort, which is not threaded (either lines 916 or 999 or both of muelu/src/Graph/MueLu_CoalesceDropFactory_def.hpp). As increase the number of threads, and because RAP thread scales well, std::sort becomes a larger fraction of the time. By the time get to 64 threads, matrix-matrix multiplication is only ~7% of MueLu setup time and uncoupled aggregation 90% of the time.
times in seconds
t MLsetup MLrap MLagg MLcoal MLagg1 MLagg2a MLagg3a 1 259.4 174.9 78.2 58.2 8.0 3.8 6.5
2 173.0 89.9 78.1 58.1 8.0 3.8 6.5
4 130.1 47.4 78.2 58.2 8.0 3.8 6.5
8 107.6 25.5 78.2 58.2 8.0 3.8 6.5
16 96.4 14.4 78.2 58.2 8.0 3.8 6.5 32 90.8 9.0 78.2 58.2 8.0 3.8 6.5 64 88.0 6.2 78.2 58.2 8.0 3.8 6.5
legend: MLsetup="MueLu Hierarchy: Setup (total)" MLrap="MueLu RAPFactory: Computing Ac" (not "total") MLagg="UncoupledAggregationFactory: Build (total)" MLcoal="MueLu CoalesceDropFactory: Build" (not "total") MLagg1="MueLu AggregationPhase1Algorithm: BuildAggregates" MLagg2a="MueLu AggregationPhase2aAlgorithm: BuildAggregates" MLagg2b="MueLu AggregationPhase2bAlgorithm: BuildAggregates"
What would be the schedule for threading uncoupled aggregation?