Open MTCam opened 2 years ago
Do you have the full exception message? (This is just the traceback leading up to the raise
AFAICT.)
Do you have the full exception message? (This is just the traceback leading up to the
raise
AFAICT.)
I could not find a raise in the batch job output. This appeared to be the only message out-of-the-ordinary when the job failed. Here's a gist with the entire output: https://gist.github.com/MTCam/f48991f584755b6a8530dd9345dc2de4
What's happening there is that the transform code is running into a pattern of loop nests that it's not familiar with, see this code in @kaushikcfd's branch:
The message I was looking for is
NotImplementedError: Cannot fit loop nest 'frozenset({'idof_ensm10'})' into known set of loop-nest patterns.
which seems very weird; it's a loop over only intra-element DOFs without a loop over elements surrounding it.
As a near-term workaround, you can just hack that code to fall back to the more general single-grid work-balancing strategy also in the compiled-function case, in the code I linked.
Beyond that, I would like to understand how we wound up with a lonely DOF loop.
Thanks for the report.
For this case, it will hang for about 3 hours,
It's safe to call this an error. However, I would expect cfd_rhs
to have the largest compilation time and the log file says:
transform_loopy_program for 'cfd_rhs': completed (112.52s wall 1.00x CPU)
Am I missing the kernel to blame here?
Beyond that, I would like to understand how we wound up with a lonely DOF loop.
I agree. @MTCam: Does the driver work correctly while running with lower #ranks?
Thanks for the report.
It's safe to call this an error. However, I would expect
cfd_rhs
to have the largest compilation time and the log file says:transform_loopy_program for 'cfd_rhs': completed (112.52s wall 1.00x CPU)
Am I missing the kernel to blame here?
This compile exits abnormally after about 10 minutes. The actual, full compile should take more than an hour, as evidenced by the compile times for the lower rank runs: (nranks, compile time) = [(32, 733s), (64, 1448s), (128, 2455s)]
Beyond that, I would like to understand how we wound up with a lonely DOF loop.
I agree. @MTCam: Does the driver work correctly while running with lower #ranks?
Yes the driver runs successfully for a number of different meshes up to 6M elements. The difference between this run and the previous working run is that it is a factor of 2 more ranks and a factor of 2 more elements.
(nranks, compile time) = [(32, 733s), (64, 1448s), (128, 2455s)]
Not sure I'm convinced that should be the case. Synchronizations/filesystem issues come to mind. (@inducer any guesses why this might be happening?)
This timing looks suspiciously exponential in the number of ranks. I do not understand why that would be the case.
@MTCam:
The only well-justified cost growth in compilation would be linear in the max number of neighbors
Are you suggesting this might be because of the increased comm pytato nodes? If so, the lines in the log file of the form:
find_partition: Split 136095 nodes into 3 parts, with [41034, 8180, 88338] nodes in each partition.
would be helpful to validate that hypothesis for the 3 mesh sizes with increasing compile times.
Wow, not sure I'd seen that. 136,095 nodes? That's kinda nuts. There's no reason for there to be that many of them, IMO.
This timing looks suspiciously exponential in the number of ranks. I do not understand why that would be the case.
@MTCam:
- What exactly is being measured?
The timer starts at the first call to logpyle.tick_before
. The time will include the compile of the RHS (and only the RHS) and the time of the first step. The step timings here are O(1s). @matthiasdiener can verify about what is being timed for this.
- Do you happen to know the average/max number of neighbors per rank? The only well-justified cost growth in compilation would be linear in the max number of neighbors.
Not offhand - but I do know it isn't going crazy. I will add a snippet to give us the break-down.
The only well-justified cost growth in compilation would be linear in the max number of neighbors
Are you suggesting this might be because of the increased comm pytato nodes? If so, the lines in the log file of the form:
find_partition: Split 136095 nodes into 3 parts, with [41034, 8180, 88338] nodes in each partition.
would be helpful to validate that hypothesis for the 3 mesh sizes with increasing compile times.
FWIW, the compile times are independent of mesh size. For a given RHS, the compile times increase with increasing numbers of ranks. This is not a new "feature" of distributed lazy. It has always been this way - although now it may be a little worse than before. I am not sure because we were never able to run our cases out this far before now.
This is not a new "feature" of distributed lazy. It has always been this way - although now it may be a little worse than before.
It's safe to call it a bug that ought to be fixed, since, purely based on principles we don't expect it. If you have access to the log files for the 3 mesh sizes that would be helpful to debug this behavior.
- Do you happen to know the average/max number of neighbors per rank? The only well-justified cost growth in compilation would be linear in the max number of neighbors.
@inducer : I am going to post a running report on the number of rank neighbors for runs. As the data comes in, I'll edit this msg. I hope it can help us suss out whether the number of partition boundaries is playing a significant role. nranks |
nnbrs (min, max, avg) |
---|---|
1 | (0, 0, 0) |
2 | (1 , 1, 1) |
4 | (3, 3, 3) |
8 | (7, 7, 7) |
16 | (10, 11, 10.9) |
32 | (10, 15, ) |
64 | (11, 19, ) |
128 | (12, 19, ) most with 17, 18 |
If you have access to the log files for the 3 mesh sizes that would be helpful to debug this behavior.
1.5M elems, 32 ranks: https://gist.github.com/db1139f6db451f0cd921fab6a6d69223 3M elems, 64 ranks: https://gist.github.com/07237afdfd5a561f250de8f1cb0244a9 6M elems, 128 ranks: https://gist.github.com/b9518b0f26d4a5269d5bf4082c76f9a5
Hope this helps! Thanks a ton for looking into this!
Thanks, so here is the ranks vs number of pytato nodes:
nranks | max. num pytato nodes across ranks |
---|---|
32 | 40703 |
64 | 76475 |
128 | 106285 |
256 | 136095 |
So, at least it's not growing linearly and starts tapering off once our number of neighboring ranks starts to taper off. So @inducer's hunch was correct, but I'm very surprised how come communication nodes outgrow the total numberof nodes on a single rank. Either something in grudge
is adding redundant nodes or our materialization strategy in pytato is broken.
Because of these many nodes, we materialize more arrays and the generated loopy kernel has an increasing number of statements. | nranks | max. num lpy insns across ranks and parts | max. time for transform_loopy_program |
---|---|---|---|
32 | 1710 | 145 secs | |
64 | 3204 | 435 secs | |
128 | 4449 | 807 secs |
Either something in
grudge
is adding redundant nodes or our materialization strategy in pytato is broken.
I completely agree! FWIW, I think that's also consistent with where we left off in terms of the mysteriously-large sum of flux expressions in July. I strongly suspect that something there is causing us to do way more work than necessary. Indirectly, I think this is also what caused the unreasonably-large argument counts. I tried to figure out for a while what that might be, but I ran out of time before my trip. To allow us to run at all, we decided to pursue SVM, but I think the underlying issue remains.
- Do you happen to know the average/max number of neighbors per rank? The only well-justified cost growth in compilation would be linear in the max number of neighbors.
@inducer, @kaushikcfd You guys are amazing. Fixing the number of rank neighbors to 2, gives a ~fixed compile time regardless of nrank, and excellent weak scaling.
nranks | compile time | walltime/RHS |
---|---|---|
1 | 282s | .18 |
2 | 282 | .182 |
4 | 368.1 | .218 |
8 | 368.5 | .219 |
16 | 373.9 | .224 |
32 | 374 | .227 |
64 | 372.8 | .223 |
Potentially we can do this with our production/prediction cases by slicing them in the flow-normal direction.
Heh. I'm glad that works, but it's not exactly a practical constraint. :) I.e., we shouldn't have to do that. But if it serves as a stopgap for a bit, I guess I could live with that.
Looking at this with @matthiasdiener, we were able to reproduce an allegedly pathological scaling for the DAG using the wave-op-mpi.py
example in grudge.
For the wave example in mirgecom: nranks | wave@migecom | wave@grudge |
---|---|---|
1 | 168 | 166 |
2 | 253 | 250 |
4 | 407 | 402 |
8 | 554 | |
16 | 648 |
A DAG viz can be found here, where we're tracking the ongoing effort. We plan on creating an even simpler example to get more insight about why the DAG is scaling this way.
You're right, that's a healthy amount of growth even just for the wave operator.
I find that the DAG pictures have low information density, i.e. it's hard to "see the forest", because the nodes are so big and so unevenly sized. I think we can help ourselves by making it so we can see more of the connectivity in one screenful.
You're right, that's a healthy amount of growth even just for the wave operator.
I find that the DAG pictures have low information density, i.e. it's hard to "see the forest", because the nodes are so big and so unevenly sized. I think we can help ourselves by making it so we can see more of the connectivity in one screenful.
Even our super simple grad-only DG operator produced a pretty big graph with 58 nodes. So we rolled a very simple non-DG example that I think illustrates what's going on. Please have a look at the bottom of the running notes for a summary: https://notes.tiker.net/t96b1_VPRdSeWkzJfuRmQw?both
To be clear, some level of duplication in the DAG is expected because we do not carry a global "registry" of constant arrays (or data wrappers in pytato-speak). So every time pytato sees a constant (like an array used for indirect indexing, or an operator matrix), it views it as new and makes a separate graph initially. As close to the first step in the transform pipeline, this redundancy is removed.
To see what this will do, simply feed the DAG through deduplicate_data_wrappers
and visualize it after that. If you still observe more nodes than you expect after that, then the issue you see is likely to be real.
To be clear, some level of duplication in the DAG is expected because we do not carry a global "registry" of constant arrays (or data wrappers in pytato-speak). So every time pytato sees a constant (like an array used for indirect indexing, or an operator matrix), it views it as new and makes a separate graph initially. As close to the first step in the transform pipeline, this redundancy is removed.
To see what this will do, simply feed the DAG through
deduplicate_data_wrappers
and visualize it after that. If you still observe more nodes than you expect after that, then the issue you see is likely to be real.
The number of graph nodes reported, and created by the examples that reproduce and demonstrate the issue are unchanged by the deduplicate_data_wrappers
utility. I guess we are probably reporting them after such duplications are eliminated.
I've been using this compile_trace_callback
function:
def my_ctc(prg_id, stage, prg):
from pytato.transform import deduplicate_data_wrappers
if stage == "pre_find_distributed_partition":
print(f"NUM DAG NODES BEFORE DeDupe: {pt.analysis.get_num_nodes(prg)}")
prg_dedupe = deduplicate_data_wrappers(prg)
print(f"NUM DAG NODES AFTER DeDupe: {pt.analysis.get_num_nodes(prg_dedupe)}")
pt.show_dot_graph(prg_dedupe)
1/0
I think that the example at the bottom of the notes demonstrates that the observed DAG scaling with BCs and with partition boundaries is actually expected. The "sub-dag" that represents the flux calculations is duplicated with every unique BC or partition boundary. Since communication/recv dag nodes are unique for each mpi neighbor, the flux calculation dag nodes are repeated for each mpi neighbor. We are struggling to come up with a viable plan for what to do about it.
I agree with you; we certainly splat a copy of the facial flux into the DAG once for each parallel boundary. I think there are probably other less-than-ideal things we're doing, but that one seems sufficiently straightforward to fix. The duplication of DAG nodes can be avoided by making an array that concatenate (via actx.np.concatenate
) all the flux data into one array per "input", applies the flux function to all those arrays at in one big batch, and then saws them back apart into appropriately-sized arrays. The best bit about this is that you can try this in mirgecom right now, without infrastructure help.
I can see two possible gotchas, but maybe @kaushikcfd can help with those:
ImplStored
tag on the glued array? - https://github.com/inducer/pytato/issues/362)(...) but that one seems sufficiently straightforward to fix.
This is great. My attempts so far have failed.
The duplication of DAG nodes can be avoided by making an array that concatenate (via
actx.np.concatenate
) all the flux data into one array per "input", applies the flux function to all those arrays at in one big batch, and then saws them back apart into appropriately-sized arrays. The best bit about this is that you can try this in mirgecom right now, without infrastructure help.
After Friday's discussion, we came up with a short list of things to try. I had thought that projecting all of the received data to a surface that was all of the remote partition boundaries union'd together would resolve this (duplicating not the flux calc, but instead just the projection), but @majosm suggested maybe just projecting to "all_faces" would be easier, quicker to try, so I tried this but it did not seem to resolve the issue. I think lazy was too smart and saw right through this attempt to hide where the data came from.
I hope this concatenation strategy works better. Based on the failed attempt to resolve by projecting to "all_faces", do you still expect concatenating the arrays will work around the issue?
What about "evaluating" each received array as a part of the "communication dag node"? Would that help?
I can see two possible gotchas, but maybe @kaushikcfd can help with those: (...)
- The "gluing" also needs to happen to things that the "user" (you, @MTCam :) doesn't consider inputs, but pytato does, i.e. things like normals and index arrays.
Got it. The normals I think seem straightforward, but do we explicitly handle index arrays? I assume you are talking about index arrays down in the DOFArray or other discretization data structure beyond those that would be updated automatically by concatenate
(if any). Can you or @kaushikcfd give some more specific hints about which index arrays we'll need to pay attention to and where they live?
I was going through MikeC's driver with a single $\nabla\cdot u$ and tried running it over 16 ranks, for which Metis gives the following partitioning: and we get the following RHS kernel for each rank: https://gist.github.com/kaushikcfd/13dcb39130c7e8f5195c224b751a8b90. The generated kernel seems very close to what one would write up by hand (at least after we have fixes for https://github.com/inducer/meshmode/pull/354 and https://github.com/inducer/meshmode/issues/355).
One odd characteristic of the generated code is the way the communicated values are handled, it is currently handled as--
The problem is in step (2) where the flux computation for each neighboring rank is "duplicated", however, if one were to implement this by hand, one would concatenate all the flux values contributing to the interior faces and perform the numerical flux computation using this single concatenated array per field (instead for each of the $N{\text{field}}\cdot N{bdry}$ arrays that we currently have).
The current expression structure is not only adding too many nodes in the graph but also forcing the computation of smaller-sized arrays (per boundary facial values) which is a poor workload for wide SIMD devices.
I'm guessing this demands some changes in grudge.op.interior_trace_pairs
.
(MikeC's my_add
code from the notes
also highlights the problem, but in that case, it is not trivial to concatenate the intermediate arrays to use the same subgraphs. However, this is not the case for DG)
One odd characteristic of the generated code is the way the communicated values are handled, it is currently handled as--
- Facial values are received from a neighbor. This is received as an array per field per neighbor.
- Flux values are computed for each neighboring boundary, interior faces and boundary faces and stored into a separate array for each field.
- The flux contributions from (2) are added up.
The problem is in step (2) where the flux computation for each neighboring rank is "duplicated", however, if one were to implement this by hand, one would concatenate all the flux values contributing to the interior faces and perform the numerical flux computation using this single concatenated array per field (instead for each of the Nfield⋅Nbdry arrays that we currently have).
The current expression structure is not only adding too many nodes in the graph but also forcing the computation of smaller-sized arrays (per boundary facial values) which is a poor workload for wide SIMD devices.
I'm guessing this demands some changes in
grudge.op.interior_trace_pairs
.
One possible caveat: concatenating all of the arrays and computing the fluxes together would (I think) remove the ability to overlap communication/computation here. Not sure if this is something distributed lazy currently does/will do?
Also, if I understand correctly this is close to what @MTCam tried by projecting all of the trace pairs to "all_faces"
and summing before passing them to the flux computation. Weird that it didn't seem to help. Though some work is done in interior_trace_pairs
after receiving (opposite_face_connection
, etc.), not sure how much that contributes.
Do we have a sense of what a reasonable number of nodes per flux computation should be? >5000 per trace pair (rough guess) seems like a lot?
(I think) remove the ability to overlap communication/computation here
The restructuring I was pointing to comes after the facial values are received, so I think it shouldn't hurt it.
Do we have a sense of what a reasonable number of nodes per flux computation should be? >5000 (rough guess) seems like a lot?
I think that with every new neighbor there should be <10 extra nodes in the DAG per flux component, another actx.np.where
, ielement indirection, idof indirection, comm node, extra data wrappers.
(I think) remove the ability to overlap communication/computation here
The restructuring I was pointing to comes after the facial values are received, so I think it shouldn't hurt it.
I share @majosm's concern here. The duplication of the flux computation is what allows the fluxes to be computed from some boundaries before communication has completed on all boundaries. Requiring all communication to complete before computing fluxes seems to work counter to an important advantage/feature of lazy.
Do we have a sense of what a reasonable number of nodes per flux computation should be? >5000 (rough guess) seems like a lot?
I think that with every new neighbor there should be <10 extra nodes in the DAG per flux component, another
actx.np.where
, ielement indirection, idof indirection, comm node, extra data wrappers.
I have not looked for the prediction driver yet, but for the example grad(u) only operator, the number of dag nodes for the flux computation is ~40. For a CNS fluid-only case it was ~1500. IIRC mixture pushed it up around 4000 per flux computation.
This is a weak scaling (periodic 3D box, p=3) case to see how far we can push running real simulations on Lassen GPUs. It has 48k elements/rank and exercises most of the components used in our prediction. After running successfully for all mesh sizes (nelem, ngpus)=[(48k, 1), (96k, 2), (192k, 4), (384k, 8), (768k, 16), (1.5M, 32), (3M, 64) , (6M, 128)], it fails at (nelem, ngpus) = (12M, 256).
*Note: To reproduce, make sure to request at least 5 hours on 64 Lassen batch nodes.
Instructions for reproducing on Lassen:
Branch: mirgecom@production Driver:
examples/combozzle-mpi.py
Setup:setup.yaml
)The case does have an unexplained hang at array context creation time which grows with the number of ranks. For this case, it will hang for about 3 hours, and then compile for about 10 minutes before finally ending with this error:
Gist with the entire batch output file: gist.github.com/MTCam/f48991f584755b6a8530dd9345dc2de4