Closed spdomin closed 3 years ago
There was a subtle Zoltan2 bug that expressed itself either as a hang or a segfault in an MPI_Allreduce. It was fixed in March. If you have an old Trilinos, perhaps it is the cause of your trouble.
@kddevin, thank you for the info. I am rebuilding the code base with the latest Trilinos version and will report back ASAP.
@trilinos/muelu
Okay, my latest KNL 1000 node element-based job is also hanging:
Nalu reports the following in the log file: Trilinos 12.17-g7396a09ea6
As noted before, the pure edge-based scheme is running, however, timings look way, way off. For example, the setup time reported 120 seconds (luckily only performed once in this static mesh configuration) while taking 20 the steps took eight hours (timed out).
Any help on the hangs will be really appreciated. In the meantime, I am moving to test the new executable on an open-science run made three years ago. I have the scaling data and want to perform a sanity check on performance.
Best,
There was a subtle Zoltan2 bug that expressed itself either as a hang or a segfault in an MPI_Allreduce. It was fixed in March. If you have an old Trilinos, perhaps it is the cause of your trouble.
@kddevin Since MueLu's setup is finishing (I think, though I haven't seen the actual output yet), I don't think this is the problem.
We have backed up from the production DNS mesh to attempt running a 1000 node (KNL) wall-resolved LES mesh. This mesh is just under 1 B elements/nodes (low-order Hex8). Sadly, the same behavior reported above is seen on this mesh as what we have seen with the production DNS mesh.
Our wall-modeled LES simulations that use ~150 million elements is running fine - both at the normal node count and at the 1000 node count.
In all cases, the xml file is the same.
Any advice on how to proceed with be most welcome. I plan on turning on a trace to see that provides any info. However, I think that once we: myEQS->solve(), its out of our hands.
Are there any Balos or internal Trilinos traces that could be activated in addition to the Muelu high verbosity setting?
@spdomin Could you please set export TPETRA_DEBUG=1; export TPETRA_VERBOSE=1
in your run scripts? This should enable more verbose output in Tpetra without requiring recompilation.
will do. I should have results by the time we meet tomorrow. Thank you for the suggestion.
@spdomin I replied to your email. I see why the hang is occurring (MueLu is generating 10 levels, the coarsest matrix has 413K rows, and the coarsest level smoother is a direct solve). MueLu is coarsening really slowly because of a large number of singletons. The additional option
<Parameter name="repartition: target rows per proc" type="int" value="10000"/>
might help. It could also be that the dropping tolerance is incorrect.
Good news on the HSW partition... My cases are running with the latest xml. We can chat offline today about other specifics.
Not ready to close, however, promising.
My KNL and HSW jobs are running now with the new input file. Thus far, I have ~10,000 steps on the DNS case. I also ran this case with the drop tolerance between 0.02 and 0.005 with and without distance Laplacian. I am also running the input.xml case with a lower drop (from 0.02 to 0.01) with and without distance Laplacian.
I will report back on all of these numbers and then we can close the case.
My P2 14 billion node mesh case is failing in EquationSystme::initialize() due to a memory issue. I cranked up the core node count to 6000 KNL (64 rank/node). Both should have been plenty.
@spdomin I'm quite interested in the results of your drop tolerance exploration, thanks for doing that and for the update.
Data still in development:)
However, my 16 billion node mesh (Hex27) is failing in a similar manner to the coarse mesh. However, this case is using the input.xml file from above.. I re-launched with more diagnostics..
Data clearly shows that a drop tolerance of 0.01 without distance Laplacian is best for my baseline production DNS.
However, using the same input xml fails in my production DNS.
@jhux2, any suggestions would be welcome.
Data clearly shows that a drop tolerance of 0.01 without distance Laplacian is best for my baseline production DNS.
However, using the same input xml fails in my production DNS.
@spdomin Please remind me, what is the difference between "baseline production" and "production"? Is it mesh size and core count, or something else? Thanks.
My baseline production DNS is ~2 billion Hex8 elements. This case is running now with the input.xml. The production DNS is a promotion of the Hex8 mesh to a Hex27. As such, ~2 billion elements, now with ~14 billion nodes. This case continues to fail despite every modification to the input.xml (drop tolerance changes, distance Laplacian, etc). However, without the production case successfully running, my study is in jeopardy.
Stefan, you've said that the failure doesn't reproduce for smaller meshes. Have you noticed whether the threshold between working and failing occurs at the 2B level? Just wondering if something is 32-bit limited, or if it's some unrelated issue.
I have successfully re-run the Trinity Open Science case - which is a 6 billion element simulation. There, I see success - although the solver stack seems 2x slower than the last time I benchmarked this case three years ago.
@alanw0, also, I successfully promoted my baseline LES Hex8 mesh to the Hex27 mesh and this case worked fine. I think I have confidence in the Hex27 DNS mesh provided from mesh_adapt - especially since the momentum solve proceeds without any obvious issue.
@spdomin wrote:
although the solver stack seems 2x slower than the last time I benchmarked this case three years ago.
Do you have more specific data? Is it assembly, solver setup, actual solves, or all of the above?
@spdomin Is Hex27 is a p2 element?
Yes, Hex27 is a P=2 quadratic element. We have run this element type in the past - also a a scale very near what I am running now.
As for data as per @mhoemmen 's question, all I have now is a set of log files from three years ago that captured a high-level performance snapshot for our TOS case. Moreover, the data was from a restart that used a RANS solution as an IC. The study was not formal and, as such, does not include things like statistics.
Do you (or the Muelu team - or the ifpack2/Tpetra team) have active scaling and performance data for solves at a reasonable scale? If so, I would trust this better. The last formal study that I conducted was between ML and Muelu and showed speed-up in the edge-based solve and slow-down in the element-based solve.
However, my greater worry is my ability to push through the large-scale production runs. My in-house Nalu memory checker did not seem to show anything excessive. Note that the memory diagnostic has data for the 3x3 momentum system.
@spdomin The MueLu team definitely has gathered performance data at scale, at various times. This is why the 2x figure surprised me. I haven't seen any other data to suggest this, so I wondered if it was a problem-specific effect.
Let me try to gather more statistics on this Trinity Open Science run once my current project is off and running.
Update: My P=2 simulation was successful on 6000 KNL nodes (64 dpi ranks/node) using the drop tolerance of 0.01 and lack of distance Laplacian. This is a good step.
The case was struggling with continuity solves (taking all 50 of the max iterations).
I am re-launching to see if this success is reproducible. The job eventually died due to a node failure.
Update:
a) My P=2 case continues to hang despite nearly every XML option I have tried. b) a second DNS of a pipe flow is also hanging in Muelu. This mesh is much smaller (15 million elements, Hex8) and shows repeated failure on 10 nodes of Eclipse. I have written @jhux2 privately to point him to the mesh and input file. I am not sure that these are related, however, suspicious enough to warrant a dive on this case to see if it teaches anything about why the production runs remain broken.
@spdomin The pipeflow appears to be hanging in repartitioning during AMG setup. If I understand, in the larger simulations, the linear solve is stagnating.
Let's see if @trilinos/zoltan2 might have any debugging suggestions.
How about setting
repartition: target rows per proc = 10000
That might make it easier to solve the partitioning problem..
How about setting
repartition: target rows per proc = 10000
That might make it easier to solve the partitioning problem..
Yes, @jhux2 suggested this 22 days ago (sorry, I do not see an easy way to link the communication threads):
<Parameter name="repartition: target rows per proc" type="int" value="10000"/>
That might make it easier to solve the partitioning problem..
I suggest that option to make the coarsening more effective and reduce the overall number of levels. It might help here, but this smells more like an deadlock.
This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity.
If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE
label.
If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE
.
If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.
begone autobot
This is still an open issue and my project has simply worked around the complexity by dropping this study. @jhux2 or @ccober6, I would be happy if you could please capture this requirement somewhere in the Trilinos requirements process.
With best wishes,
Bug Report
@trilinos/muelu
Description
I am running large-scale low-Mach runs on both KNL and HSW at the ~500+ node count. For KNL, I use 64 ranks/node while on HSW, 32. Threading is not active, however, SIMD is.
My ASC V&V and LSCI case is a non-isothermal flow. After a successful momentum solve using SGS/GMRES, the Muelu portion hangs just after the banner report for the initialization time (here, it is 62 seconds). For debug purposes, I ran the edge-based scheme at 1/2 the node count of the element-based run. The edge-based scheme does not hang (on the same mesh).
Steps to Reproduce
I was a bit surprised at the older version of Trilinos (I inherited the build script). As such, I am re-building Nalu with a newer version, although 12.13 seems like a valid released code.
I will need to take this discussion privately to my SNL account, however, I wanted to capture the ticket.
The xml file appears below.
Best,