spdomin commented 5 years ago

Bug Report

@trilinos/muelu

Description

I am running large-scale low-Mach runs on both KNL and HSW at the ~500+ node count. For KNL, I use 64 ranks/node while on HSW, 32. Threading is not active, however, SIMD is.

My ASC V&V and LSCI case is a non-isothermal flow. After a successful momentum solve using SGS/GMRES, the Muelu portion hangs just after the banner report for the initialization time (here, it is 62 seconds). For debug purposes, I ran the edge-based scheme at 1/2 the node count of the element-based run. The edge-based scheme does not hang (on the same mesh).

Steps to Reproduce

SHA1: [12.13-gb45b453bc0]
NaluCFD/Nalu: master/head
Input files: see me offline

I was a bit surprised at the older version of Trilinos (I inherited the build script). As such, I am re-building Nalu with a newer version, although 12.13 seems like a valid released code.

I will need to take this discussion privately to my SNL account, however, I wanted to capture the ticket.

The xml file appears below.

Best,


<ParameterList name="MueLu">
  <Parameter        name="verbosity"                        type="string"   value="none"/>
  <Parameter        name="coarse: max size"                 type="int"      value="1000"/>
  <Parameter        name="max levels"                       type="int"      value="10"/>
  <Parameter        name="cycle type"                        type="string"   value="W"/>

<!--
  <Parameter        name="coarse: type"                   type="string"   value="RELAXATION"/>
  <ParameterList    name="coarse: params">
     <Parameter name="relaxation: type"      type="string"   value="Symmetric Gauss-Seidel"/>
     <Parameter name="relaxation: use l1"    type="bool"     value="true"/>
     <Parameter name="relaxation: sweeps"    type="int"      value="5"/>
  </ParameterList>
-->

  <Parameter        name="smoother: type"                   type="string"   value="CHEBYSHEV"/>
  <ParameterList    name="smoother: params">
     <Parameter name="chebyshev: degree"                    type="int"      value="2"/>
     <Parameter name="chebyshev: ratio eigenvalue"          type="double"   value="15"/>
     <Parameter name="chebyshev: min eigenvalue"            type="double"   value="1.0"/>
     <Parameter name="chebyshev: zero starting solution"    type="bool"     value="true"/>
     <Parameter name="chebyshev: eigenvalue max iterations" type="int"      value="10"/>
  </ParameterList>

  <Parameter        name="sa: damping factor"               type="double"   value="0."/>
  <Parameter        name="tentative: calculate qr"          type="bool"     value="false"/>

  <Parameter        name="aggregation: type"                type="string"   value="uncoupled"/>
  <Parameter        name="aggregation: drop tol"            type="double"   value="0.02"/>
  <Parameter        name="aggregation: drop scheme"         type="string"   value="distance laplacian"/>

  <Parameter        name="repartition: enable"              type="bool"     value="true"/>
  <Parameter        name="repartition: min rows per proc"   type="int"      value="1000"/>
  <Parameter        name="repartition: start level"         type="int"      value="2"/>
  <Parameter        name="repartition: max imbalance"       type="double"   value="1.327"/>
  <Parameter        name="repartition: partitioner"         type="string"   value="zoltan2"/>
</ParameterList>

kddevin commented 5 years ago

There was a subtle Zoltan2 bug that expressed itself either as a hang or a segfault in an MPI_Allreduce. It was fixed in March. If you have an old Trilinos, perhaps it is the cause of your trouble.

4669

spdomin commented 5 years ago

@kddevin, thank you for the info. I am rebuilding the code base with the latest Trilinos version and will report back ASAP.

jhux2 commented 5 years ago

@trilinos/muelu

spdomin commented 5 years ago

Okay, my latest KNL 1000 node element-based job is also hanging:

Nalu reports the following in the log file: Trilinos 12.17-g7396a09ea6

As noted before, the pure edge-based scheme is running, however, timings look way, way off. For example, the setup time reported 120 seconds (luckily only performed once in this static mesh configuration) while taking 20 the steps took eight hours (timed out).

Any help on the hangs will be really appreciated. In the meantime, I am moving to test the new executable on an open-science run made three years ago. I have the scaling data and want to perform a sanity check on performance.

Best,

jhux2 commented 5 years ago

There was a subtle Zoltan2 bug that expressed itself either as a hang or a segfault in an MPI_Allreduce. It was fixed in March. If you have an old Trilinos, perhaps it is the cause of your trouble.

@kddevin Since MueLu's setup is finishing (I think, though I haven't seen the actual output yet), I don't think this is the problem.

spdomin commented 5 years ago

We have backed up from the production DNS mesh to attempt running a 1000 node (KNL) wall-resolved LES mesh. This mesh is just under 1 B elements/nodes (low-order Hex8). Sadly, the same behavior reported above is seen on this mesh as what we have seen with the production DNS mesh.

Our wall-modeled LES simulations that use ~150 million elements is running fine - both at the normal node count and at the 1000 node count.

In all cases, the xml file is the same.

Any advice on how to proceed with be most welcome. I plan on turning on a trace to see that provides any info. However, I think that once we: myEQS->solve(), its out of our hands.

Are there any Balos or internal Trilinos traces that could be activated in addition to the Muelu high verbosity setting?

jhux2 commented 5 years ago

@spdomin Could you please set export TPETRA_DEBUG=1; export TPETRA_VERBOSE=1 in your run scripts? This should enable more verbose output in Tpetra without requiring recompilation.

spdomin commented 5 years ago

will do. I should have results by the time we meet tomorrow. Thank you for the suggestion.

jhux2 commented 5 years ago

@spdomin I replied to your email. I see why the hang is occurring (MueLu is generating 10 levels, the coarsest matrix has 413K rows, and the coarsest level smoother is a direct solve). MueLu is coarsening really slowly because of a large number of singletons. The additional option

<Parameter        name="repartition: target rows per proc"   type="int"   value="10000"/>

might help. It could also be that the dropping tolerance is incorrect.

spdomin commented 5 years ago

Good news on the HSW partition... My cases are running with the latest xml. We can chat offline today about other specifics.

Not ready to close, however, promising.

spdomin commented 5 years ago

My KNL and HSW jobs are running now with the new input file. Thus far, I have ~10,000 steps on the DNS case. I also ran this case with the drop tolerance between 0.02 and 0.005 with and without distance Laplacian. I am also running the input.xml case with a lower drop (from 0.02 to 0.01) with and without distance Laplacian.

I will report back on all of these numbers and then we can close the case.

My P2 14 billion node mesh case is failing in EquationSystme::initialize() due to a memory issue. I cranked up the core node count to 6000 KNL (64 rank/node). Both should have been plenty.

jhux2 commented 5 years ago

@spdomin I'm quite interested in the results of your drop tolerance exploration, thanks for doing that and for the update.

spdomin commented 5 years ago

Data still in development:)

However, my 16 billion node mesh (Hex27) is failing in a similar manner to the coarse mesh. However, this case is using the input.xml file from above.. I re-launched with more diagnostics..

spdomin commented 5 years ago

Data clearly shows that a drop tolerance of 0.01 without distance Laplacian is best for my baseline production DNS.

However, using the same input xml fails in my production DNS.

@jhux2, any suggestions would be welcome.

jhux2 commented 5 years ago

Data clearly shows that a drop tolerance of 0.01 without distance Laplacian is best for my baseline production DNS.

However, using the same input xml fails in my production DNS.

@spdomin Please remind me, what is the difference between "baseline production" and "production"? Is it mesh size and core count, or something else? Thanks.

spdomin commented 5 years ago

My baseline production DNS is ~2 billion Hex8 elements. This case is running now with the input.xml. The production DNS is a promotion of the Hex8 mesh to a Hex27. As such, ~2 billion elements, now with ~14 billion nodes. This case continues to fail despite every modification to the input.xml (drop tolerance changes, distance Laplacian, etc). However, without the production case successfully running, my study is in jeopardy.

alanw0 commented 5 years ago

Stefan, you've said that the failure doesn't reproduce for smaller meshes. Have you noticed whether the threshold between working and failing occurs at the 2B level? Just wondering if something is 32-bit limited, or if it's some unrelated issue.

spdomin commented 5 years ago

I have successfully re-run the Trinity Open Science case - which is a 6 billion element simulation. There, I see success - although the solver stack seems 2x slower than the last time I benchmarked this case three years ago.

spdomin commented 5 years ago

@alanw0, also, I successfully promoted my baseline LES Hex8 mesh to the Hex27 mesh and this case worked fine. I think I have confidence in the Hex27 DNS mesh provided from mesh_adapt - especially since the momentum solve proceeds without any obvious issue.

mhoemmen commented 5 years ago

@spdomin wrote:

although the solver stack seems 2x slower than the last time I benchmarked this case three years ago.

Do you have more specific data? Is it assembly, solver setup, actual solves, or all of the above?

jhux2 commented 5 years ago

@spdomin Is Hex27 is a p2 element?

spdomin commented 5 years ago

Yes, Hex27 is a P=2 quadratic element. We have run this element type in the past - also a a scale very near what I am running now.

As for data as per @mhoemmen 's question, all I have now is a set of log files from three years ago that captured a high-level performance snapshot for our TOS case. Moreover, the data was from a restart that used a RANS solution as an IC. The study was not formal and, as such, does not include things like statistics.

Do you (or the Muelu team - or the ifpack2/Tpetra team) have active scaling and performance data for solves at a reasonable scale? If so, I would trust this better. The last formal study that I conducted was between ML and Muelu and showed speed-up in the edge-based solve and slow-down in the element-based solve.

However, my greater worry is my ability to push through the large-scale production runs. My in-house Nalu memory checker did not seem to show anything excessive. Note that the memory diagnostic has data for the 3x3 momentum system.

mhoemmen commented 5 years ago

@spdomin The MueLu team definitely has gathered performance data at scale, at various times. This is why the 2x figure surprised me. I haven't seen any other data to suggest this, so I wondered if it was a problem-specific effect.

spdomin commented 5 years ago

Let me try to gather more statistics on this Trinity Open Science run once my current project is off and running.

spdomin commented 5 years ago

Update: My P=2 simulation was successful on 6000 KNL nodes (64 dpi ranks/node) using the drop tolerance of 0.01 and lack of distance Laplacian. This is a good step.

The case was struggling with continuity solves (taking all 50 of the max iterations).

I am re-launching to see if this success is reproducible. The job eventually died due to a node failure.

spdomin commented 5 years ago

Update:

a) My P=2 case continues to hang despite nearly every XML option I have tried. b) a second DNS of a pipe flow is also hanging in Muelu. This mesh is much smaller (15 million elements, Hex8) and shows repeated failure on 10 nodes of Eclipse. I have written @jhux2 privately to point him to the mesh and input file. I am not sure that these are related, however, suspicious enough to warrant a dive on this case to see if it teaches anything about why the production runs remain broken.

jhux2 commented 5 years ago

@spdomin The pipeflow appears to be hanging in repartitioning during AMG setup. If I understand, in the larger simulations, the linear solve is stagnating.

Let's see if @trilinos/zoltan2 might have any debugging suggestions.

MueLu setup

``` verbosity = high coarse: max size = 1000 max levels = 10 cycle type = W smoother: type = CHEBYSHEV sa: damping factor = 0 tentative: calculate qr = 0 aggregation: type = uncoupled aggregation: drop tol = 0.02 aggregation: drop scheme = distance laplacian repartition: enable = 1 repartition: min rows per proc = 1000 repartition: start level = 2 repartition: max imbalance = 1.327 repartition: partitioner = zoltan2 number of equations = 1 [default] rap: algorithm = galerkin [default] smoother: params -> chebyshev: degree = 2 [unused] chebyshev: ratio eigenvalue = 15 [unused] chebyshev: min eigenvalue = 1 [unused] chebyshev: zero starting solution = 1 [unused] chebyshev: eigenvalue max iterations = 10 [unused] repartition: params -> [empty list] Clearing old data (if any) ******* WARNING ******* Hierarchy::ReplaceCoordinateMap: matrix and coordinates maps are same, skipping... MueLu::Amesos2Smoother: using "Superlu" Using default factory (MueLu::SmootherFactory{pre = MueLu::DirectSolver{type = }, post = null}) for building 'CoarseSolver'. Using default factory (MueLu::AmalgamationFactory) for building 'UnAmalgamationInfo'. Level 0 Setup Smoother (MueLu::Ifpack2Smoother{type = CHEBYSHEV}) chebyshev: ratio eigenvalue (computed) = 15 Preconditioner init Preconditioner compute chebyshev: max eigenvalue (calculated by Ifpack2) = 1.8776 "Ifpack2::Chebyshev": {Initialized: true, Computed: true, "Ifpack2::Details::Chebyshev":{degree: 2, lambdaMax: 1.8776, alpha: 15, lambdaMin: 0.125173, boost factor: 1.1}, Global matrix dimensions: [16116679, 16116679], Global nnz: 112555837} keep smoother data = 0 [default] PreSmoother data = Teuchos::RCP > >{ptr=0,node=0,strong_count=0,weak_count=0} [default] PostSmoother data = Teuchos::RCP > >{ptr=0,node=0,strong_count=0,weak_count=0} [default] smoother -> chebyshev: degree = 2 chebyshev: ratio eigenvalue = 15 chebyshev: min eigenvalue = 1 chebyshev: zero starting solution = 1 chebyshev: eigenvalue max iterations = 10 chebyshev: boost factor = 1.1 [unused] chebyshev: min diagonal value = 2.22045e-16 [default] chebyshev: assume matrix does not change = 0 [default] MueLu::Amesos2Smoother: using "Superlu" Using default factory (MueLu::SmootherFactory{pre = MueLu::DirectSolver{type = }, post = null}) for building 'CoarseSolver'. Using default factory (MueLu::AmalgamationFactory) for building 'UnAmalgamationInfo'. Level 1 Build (MueLu::RebalanceTransferFactory) Build (MueLu::RepartitionFactory) Computing Ac (MueLu::RAPFactory) Prolongator smoothing (MueLu::SaPFactory) Matrix filtering (MueLu::FilteredAFactory) Build (MueLu::CoalesceDropFactory) Build (MueLu::AmalgamationFactory) AmalagamationFactory::Build(): found fullblocksize=1 and stridedblocksize=1 from strided maps. offset=0 [empty list] lightweight wrap = 1 algorithm = "distance laplacian": threshold = 0.02, blocksize = 1 Import construction ******* WARNING ******* Using existing importer from matrix graph Coordinate import Laplacian local diagonal Laplacian distributed diagonal Laplacian dropping Build amalgamated graph Detected 0 agglomerated Dirichlet nodes using threshold 0 Number of dropped entries in unamalgamated matrix graph: 16170298/112555837 (14.3665%) aggregation: drop tol = 0.02 aggregation: Dirichlet threshold = 0 [default] aggregation: drop scheme = distance laplacian lightweight wrap = 1 Lumping dropped entries filtered matrix: use lumping = 1 [default] filtered matrix: reuse graph = 1 [default] filtered matrix: reuse eigenvalue = 1 [default] Build (MueLu::TentativePFactory) Build (MueLu::UncoupledAggregationFactory) Algo "Phase - (Dirichlet)" BuildAggregates (Phase - (Dirichlet)) aggregated : 0 (phase), 0/16116679 [0.00%] (total) remaining : 16116679 aggregates : 0 (phase), 0 (total) Algo "Phase 1 (main)" BuildAggregates (Phase 1 (main)) aggregated : 13101735 (phase), 13101735/16116679 [81.29%] (total) remaining : 3014944 aggregates : 2472138 (phase), 2472138 (total) Algo "Phase 2a (secondary)" BuildAggregates (Phase 2a (secondary)) aggregated : 0 (phase), 13101735/16116679 [81.29%] (total) remaining : 3014944 aggregates : 0 (phase), 2472138 (total) Algo "Phase 2b (expansion)" BuildAggregates (Phase 2b (expansion)) aggregated : 3014499 (phase), 16116234/16116679 [99.99%] (total) remaining : 445 aggregates : 0 (phase), 2472138 (total) Algo "Phase 3 (cleanup)" BuildAggregates (Phase 3 (cleanup)) aggregated : 445 (phase), 16116679/16116679 [100.00%] (total) remaining : 0 aggregates : 445 (phase), 2472583 (total) "UC": MueLu::Aggregates{nGlobalAggregates = 2472583} aggregation: max agg size = -1 [default] aggregation: min agg size = 2 [default] aggregation: max selected neighbors = 0 [default] aggregation: ordering = natural [default] aggregation: enable phase 1 = 1 [default] aggregation: enable phase 2a = 1 [default] aggregation: enable phase 2b = 1 [default] aggregation: enable phase 3 = 1 [default] aggregation: preserve Dirichlet points = 0 [default] aggregation: allow user-specified singletons = 0 [default] aggregation: use interface aggregation = 0 [default] aggregation: error on nodes with no on-rank neighbors = 0 [default] aggregation: phase3 avoid singletons = 0 [default] OnePt aggregate map name = [default] OnePt aggregate map factory = [default] Interface aggregate map name = [default] Interface aggregate map factory = [default] Nullspace factory (MueLu::NullspaceFactory) Generating canonical nullspace: dimension = 1 Fine level nullspace = Nullspace Build (MueLu::CoarseMapFactory) Striding info = {} [default] Strided block id = -1 [default] Domain GID offsets = {0} [default] Column map is consistent with the row map, good. TentativePFactory : bypassing local QR phase TentativePFactory : aggregates do not cross process boundaries ******* WARNING ******* Level::Set: unable to store "Coordinates" generated by factory 0xbd63c60 on level 1, as it has not been requested and no keep flags were set for it tentative: calculate qr = 0 tentative: build coarse coordinates = 1 [default] matrixmatrix: kernel params -> [empty list] ******* WARNING ******* Level::Set: unable to store "AP reuse data" generated by factory 0xbd67f90 on level 1, as it has not been requested and no keep flags were set for it sa: damping factor = 0.00 sa: calculate eigenvalue estimate = 0 [default] sa: eigenvalue estimate num iterations = 10 [default] matrixmatrix: kernel params -> [empty list] MxM: A x P Transpose P (MueLu::TransPFactory) matrixmatrix: kernel params -> [empty list] MxM: R x (AP) (explicit) ******* WARNING ******* Level::Set: unable to store "AP reuse data" generated by factory 0xbd69220 on level 1, as it has not been requested and no keep flags were set for it ******* WARNING ******* Level::Set: unable to store "RAP reuse data" generated by factory 0xbd69220 on level 1, as it has not been requested and no keep flags were set for it Projections RAPFactory: call transfer factory: MueLu::CoordinatesTransferFactory Build (MueLu::CoordinatesTransferFactory) Transferring coordinates structured aggregation = 0 [default] aggregation coupled = 0 [default] Geometric = 0 [default] write start = -1 [default] write end = -1 [default] hybrid aggregation = 0 [default] transpose: use implicit = 0 [default] rap: triple product = 0 [default] rap: fix zero diagonals = 0 [default] CheckMainDiagonal = 0 [default] RepairMainDiagonal = 0 [default] matrixmatrix: kernel params -> [empty list] Build (MueLu::RepartitionHeuristicFactory) Repartitioning? NO: current level = 1, first level where repartitioning can happen is 2 repartition: start level = 2 repartition: min rows per proc = 1000 repartition: target rows per proc = 0 [default] repartition: max imbalance = 1.33 Build (MueLu::Zoltan2Interface) ParameterList = Teuchos::RCP{ptr=0xbd6b5d0,node=0xbd696a0,strong_count=2,weak_count=0} [unused] ******* WARNING ******* No repartitioning necessary: partitions were left unchanged by the repartitioner repartition: print partition distribution = 0 [default] repartition: remap parts = 1 [default] repartition: remap num values = 4 [default] Rebalancing prolongator Using original prolongator repartition: rebalance P and R = 0 [default] transpose: use implicit = 0 [default] repartition: use subcommunicators = 1 [default] type = Interpolation write start = -1 [default] write end = -1 [default] Build (MueLu::RebalanceTransferFactory) Rebalancing restrictor Using original restrictor repartition: rebalance P and R = 0 [default] transpose: use implicit = 0 [default] repartition: use subcommunicators = 1 [default] type = Restriction write start = -1 [default] write end = -1 [default] Computing Ac (MueLu::RebalanceAcFactory) No rebalancing repartition: use subcommunicators = 1 [default] Setup Smoother (MueLu::Ifpack2Smoother{type = CHEBYSHEV}) chebyshev: ratio eigenvalue (computed) = 15.00 Preconditioner init Preconditioner compute chebyshev: max eigenvalue (calculated by Ifpack2) = 1.84 "Ifpack2::Chebyshev": {Initialized: true, Computed: true, "Ifpack2::Details::Chebyshev":{degree: 2, lambdaMax: 1.83667, alpha: 15, lambdaMin: 0.122445, boost factor: 1.1}, Global matrix dimensions: [2472583, 2472583], Global nnz: 28019135} keep smoother data = 0 [default] PreSmoother data = Teuchos::RCP > >{ptr=0,node=0,strong_count=0,weak_count=0} [default] PostSmoother data = Teuchos::RCP > >{ptr=0,node=0,strong_count=0,weak_count=0} [default] smoother -> chebyshev: degree = 2 chebyshev: ratio eigenvalue = 15.00 chebyshev: min eigenvalue = 1.00 chebyshev: zero starting solution = 1 chebyshev: eigenvalue max iterations = 10 chebyshev: boost factor = 1.10 [unused] chebyshev: min diagonal value = 0.00 [default] chebyshev: assume matrix does not change = 0 [default] MueLu::Amesos2Smoother: using "Superlu" Using default factory (MueLu::SmootherFactory{pre = MueLu::DirectSolver{type = }, post = null}) for building 'CoarseSolver'. Using default factory (MueLu::AmalgamationFactory) for building 'UnAmalgamationInfo'. Level 2 Build (MueLu::RebalanceTransferFactory) Build (MueLu::RepartitionFactory) Computing Ac (MueLu::RAPFactory) Prolongator smoothing (MueLu::SaPFactory) Matrix filtering (MueLu::FilteredAFactory) Build (MueLu::CoalesceDropFactory) Build (MueLu::AmalgamationFactory) AmalagamationFactory::Build(): found fullblocksize=1 and stridedblocksize=1 from strided maps. offset=0 [empty list] lightweight wrap = 1 algorithm = "distance laplacian": threshold = 0.02, blocksize = 1 Import construction ******* WARNING ******* Using existing importer from matrix graph Coordinate import Laplacian local diagonal Laplacian distributed diagonal Laplacian dropping Build amalgamated graph Detected 0 agglomerated Dirichlet nodes using threshold 0.00 Number of dropped entries in unamalgamated matrix graph: 544886/28019135 (1.94%) aggregation: drop tol = 0.02 aggregation: Dirichlet threshold = 0.00 [default] aggregation: drop scheme = distance laplacian lightweight wrap = 1 Lumping dropped entries filtered matrix: use lumping = 1 [default] filtered matrix: reuse graph = 1 [default] filtered matrix: reuse eigenvalue = 1 [default] Build (MueLu::TentativePFactory) Build (MueLu::UncoupledAggregationFactory) Algo "Phase - (Dirichlet)" BuildAggregates (Phase - (Dirichlet)) aggregated : 0 (phase), 0/2472583 [0.00%] (total) remaining : 2472583 aggregates : 0 (phase), 0 (total) Algo "Phase 1 (main)" BuildAggregates (Phase 1 (main)) aggregated : 1666403 (phase), 1666403/2472583 [67.40%] (total) remaining : 806180 aggregates : 180754 (phase), 180754 (total) Algo "Phase 2a (secondary)" BuildAggregates (Phase 2a (secondary)) aggregated : 28464 (phase), 1694867/2472583 [68.55%] (total) remaining : 777716 aggregates : 3108 (phase), 183862 (total) Algo "Phase 2b (expansion)" BuildAggregates (Phase 2b (expansion)) aggregated : 777664 (phase), 2472531/2472583 [99.99%] (total) remaining : 52 aggregates : 0 (phase), 183862 (total) Algo "Phase 3 (cleanup)" BuildAggregates (Phase 3 (cleanup)) aggregated : 52 (phase), 2472583/2472583 [100.00%] (total) remaining : 0 aggregates : 52 (phase), 183914 (total) "UC": MueLu::Aggregates{nGlobalAggregates = 183914} aggregation: max agg size = -1 [default] aggregation: min agg size = 2 [default] aggregation: max selected neighbors = 0 [default] aggregation: ordering = natural [default] aggregation: enable phase 1 = 1 [default] aggregation: enable phase 2a = 1 [default] aggregation: enable phase 2b = 1 [default] aggregation: enable phase 3 = 1 [default] aggregation: preserve Dirichlet points = 0 [default] aggregation: allow user-specified singletons = 0 [default] aggregation: use interface aggregation = 0 [default] aggregation: error on nodes with no on-rank neighbors = 0 [default] aggregation: phase3 avoid singletons = 0 [default] OnePt aggregate map name = [default] OnePt aggregate map factory = [default] Interface aggregate map name = [default] Interface aggregate map factory = [default] Nullspace factory (MueLu::NullspaceFactory) Fine level nullspace = Nullspace Build (MueLu::CoarseMapFactory) Striding info = {} [default] Strided block id = -1 [default] Domain GID offsets = {0} [default] Column map is consistent with the row map, good. TentativePFactory : bypassing local QR phase TentativePFactory : aggregates do not cross process boundaries tentative: calculate qr = 0 tentative: build coarse coordinates = 1 [default] matrixmatrix: kernel params -> [empty list] ******* WARNING ******* Level::Set: unable to store "AP reuse data" generated by factory 0xbd75730 on level 2, as it has not been requested and no keep flags were set for it sa: damping factor = 0.00 sa: calculate eigenvalue estimate = 0 [default] sa: eigenvalue estimate num iterations = 10 [default] matrixmatrix: kernel params -> [empty list] MxM: A x P Transpose P (MueLu::TransPFactory) matrixmatrix: kernel params -> [empty list] MxM: R x (AP) (explicit) ******* WARNING ******* Level::Set: unable to store "AP reuse data" generated by factory 0xbae2c70 on level 2, as it has not been requested and no keep flags were set for it ******* WARNING ******* Level::Set: unable to store "RAP reuse data" generated by factory 0xbae2c70 on level 2, as it has not been requested and no keep flags were set for it Projections RAPFactory: call transfer factory: MueLu::CoordinatesTransferFactory Build (MueLu::CoordinatesTransferFactory) Transferring coordinates structured aggregation = 0 [default] aggregation coupled = 0 [default] Geometric = 0 [default] write start = -1 [default] write end = -1 [default] hybrid aggregation = 0 [default] transpose: use implicit = 0 [default] rap: triple product = 0 [default] rap: fix zero diagonals = 0 [default] CheckMainDiagonal = 0 [default] RepairMainDiagonal = 0 [default] matrixmatrix: kernel params -> [empty list] Build (MueLu::RepartitionHeuristicFactory) Repartitioning? YES: min # rows per proc = 287, min allowable = 1000 Number of partitions to use = 183 repartition: start level = 2 repartition: min rows per proc = 1000 repartition: target rows per proc = 0 [default] repartition: max imbalance = 1.33 Build (MueLu::Zoltan2Interface) Zoltan2 parameters: ---------- algorithm = multijagged [unused] partitioning_approach = partition [unused] mj_premigration_option = 1 [unused] num_global_parts = 183 [unused] ---------- Zoltan2 multijagged ```

cgcgcg commented 5 years ago

How about setting

repartition: target rows per proc = 10000

That might make it easier to solve the partitioning problem..

spdomin commented 5 years ago

How about setting
repartition: target rows per proc = 10000
That might make it easier to solve the partitioning problem..

Yes, @jhux2 suggested this 22 days ago (sorry, I do not see an easy way to link the communication threads):

<Parameter name="repartition: target rows per proc" type="int" value="10000"/>

jhux2 commented 5 years ago

That might make it easier to solve the partitioning problem..

I suggest that option to make the coarsening more effective and reduce the overall number of levels. It might help here, but this smells more like an deadlock.

github-actions[bot] commented 3 years ago

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity. If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE label. If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE. If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.

mhoemmen commented 3 years ago

begone autobot

spdomin commented 3 years ago

This is still an open issue and my project has simply worked around the complexity by dropping this study. @jhux2 or @ccober6, I would be happy if you could please capture this requirement somewhere in the Trilinos requirements process.

With best wishes,

trilinos / Trilinos

Muelu seemingly hangs on a O(1000) KNL/HSW node Nalu simulation #5146

Bug Report

Description

Steps to Reproduce

4669