Error at 256GPU/64Nodes on Lassen

MTCam commented 2 years ago

This is a weak scaling (periodic 3D box, p=3) case to see how far we can push running real simulations on Lassen GPUs. It has 48k elements/rank and exercises most of the components used in our prediction. After running successfully for all mesh sizes (nelem, ngpus)=[(48k, 1), (96k, 2), (192k, 4), (384k, 8), (768k, 16), (1.5M, 32), (3M, 64) , (6M, 128)], it fails at (nelem, ngpus) = (12M, 256).

*Note: To reproduce, make sure to request at least 5 hours on 64 Lassen batch nodes.

Instructions for reproducing on Lassen:

Branch: mirgecom@production Driver: examples/combozzle-mpi.py Setup:

Set the following parameters to control the mesh size, either hard-code or with a yaml file (e.g. setup.yaml)

x_scale: 4
y_scale: 1
z_scale: 1
weak_scale: 20

Set the following environment:

export PYOPENCL_CTX="port:tesla"
export XDG_CACHE_HOME="/tmp/$USER/xdg-scratch"
export POCL_CACHE_DIR_ROOT="/tmp/$USER/pocl-cache"
export LOOPY_NO_CACHE=1
export CUDA_CACHE_DISABLE=1

Run the case:

jsrun -a 1 -g 1 -n 256 bash -c 'POCL_CACHE_DIR=$POCL_CACHE_DIR_ROOT/$$ python -u -O -m mpi4py ./combozzle-mpi.py -i setup.yaml --lazy'

The case does have an unexplained hang at array context creation time which grows with the number of ranks. For this case, it will hang for about 3 hours, and then compile for about 10 minutes before finally ending with this error:

  File "/p/gpfs1/mtcampbe/CEESD/Experimental/svm/production/mirgecom/mirgecom/integrators/lsrk.py", line 66, in euler_step
    return lsrk_step(EulerCoefs, state, t, dt, rhs)
  File "/p/gpfs1/mtcampbe/CEESD/Experimental/svm/production/mirgecom/mirgecom/integrators/lsrk.py", line 53, in lsrk_step
    k = coefs.A[i]*k + dt*rhs(t + coefs.C[i]*dt, state)
  File "/p/gpfs1/mtcampbe/CEESD/Experimental/svm/production/arraycontext/arraycontext/impl/pytato/compile.py", line 365, in __call__
    compiled_func = self._dag_to_compiled_func(
  File "/p/gpfs1/mtcampbe/CEESD/Experimental/svm/production/grudge/grudge/array_context.py", line 286, in _dag_to_compiled_func
    ) = self._dag_to_transformed_pytato_prg(
  File "/p/gpfs1/mtcampbe/CEESD/Experimental/svm/production/arraycontext/arraycontext/impl/pytato/compile.py", line 441, in _dag_to_transformed_pytato_prg
    pytato_program = (pytato_program
  File "/p/gpfs1/mtcampbe/CEESD/Experimental/svm/production/pytato/pytato/target/loopy/__init__.py", line 142, in with_transformed_program
    return self.copy(program=f(self.program))
  File "/p/gpfs1/mtcampbe/CEESD/Experimental/svm/production/meshmode/meshmode/array_context.py", line 1726, in transform_loopy_program
    raise err
  File "/p/gpfs1/mtcampbe/CEESD/Experimental/svm/production/meshmode/meshmode/array_context.py", line 1723, in transform_loopy_program
    iel_to_idofs = _get_iel_to_idofs(knl)
  File "/p/gpfs1/mtcampbe/CEESD/Experimental/svm/production/meshmode/meshmode/array_context.py", line 1066, in _get_iel_to_idofs
    raise NotImplementedError(f"Cannot fit loop nest '{insn.within_inames}'"

Gist with the entire batch output file: gist.github.com/MTCam/f48991f584755b6a8530dd9345dc2de4

inducer commented 2 years ago

Do you have the full exception message? (This is just the traceback leading up to the raise AFAICT.)

MTCam commented 2 years ago

Do you have the full exception message? (This is just the traceback leading up to the raise AFAICT.)

I could not find a raise in the batch job output. This appeared to be the only message out-of-the-ordinary when the job failed. Here's a gist with the entire output: https://gist.github.com/MTCam/f48991f584755b6a8530dd9345dc2de4

inducer commented 2 years ago

What's happening there is that the transform code is running into a pattern of loop nests that it's not familiar with, see this code in @kaushikcfd's branch:

https://github.com/kaushikcfd/meshmode/blob/c44f1b12b1e43d006808877fb5d1d8131bc1e396/meshmode/array_context.py#L1724-L1733

The message I was looking for is

NotImplementedError: Cannot fit loop nest 'frozenset({'idof_ensm10'})' into known set of loop-nest patterns.

which seems very weird; it's a loop over only intra-element DOFs without a loop over elements surrounding it.

As a near-term workaround, you can just hack that code to fall back to the more general single-grid work-balancing strategy also in the compiled-function case, in the code I linked.

Beyond that, I would like to understand how we wound up with a lonely DOF loop.

kaushikcfd commented 2 years ago

Thanks for the report.

For this case, it will hang for about 3 hours,

It's safe to call this an error. However, I would expect cfd_rhs to have the largest compilation time and the log file says:

transform_loopy_program for 'cfd_rhs': completed (112.52s wall 1.00x CPU)

Am I missing the kernel to blame here?

Beyond that, I would like to understand how we wound up with a lonely DOF loop.

I agree. @MTCam: Does the driver work correctly while running with lower #ranks?

MTCam commented 2 years ago

Thanks for the report.

It's safe to call this an error. However, I would expect cfd_rhs to have the largest compilation time and the log file says:
transform_loopy_program for 'cfd_rhs': completed (112.52s wall 1.00x CPU)
Am I missing the kernel to blame here?

This compile exits abnormally after about 10 minutes. The actual, full compile should take more than an hour, as evidenced by the compile times for the lower rank runs: (nranks, compile time) = [(32, 733s), (64, 1448s), (128, 2455s)]

Beyond that, I would like to understand how we wound up with a lonely DOF loop.

I agree. @MTCam: Does the driver work correctly while running with lower #ranks?

Yes the driver runs successfully for a number of different meshes up to 6M elements. The difference between this run and the previous working run is that it is a factor of 2 more ranks and a factor of 2 more elements.

kaushikcfd commented 2 years ago

(nranks, compile time) = [(32, 733s), (64, 1448s), (128, 2455s)]

Not sure I'm convinced that should be the case. Synchronizations/filesystem issues come to mind. (@inducer any guesses why this might be happening?)

inducer commented 2 years ago

This timing looks suspiciously exponential in the number of ranks. I do not understand why that would be the case.

@MTCam:

What exactly is being measured?
Do you happen to know the average/max number of neighbors per rank? The only well-justified cost growth in compilation would be linear in the max number of neighbors.

kaushikcfd commented 2 years ago

The only well-justified cost growth in compilation would be linear in the max number of neighbors

Are you suggesting this might be because of the increased comm pytato nodes? If so, the lines in the log file of the form:

find_partition: Split 136095 nodes into 3 parts, with [41034, 8180, 88338] nodes in each partition.

would be helpful to validate that hypothesis for the 3 mesh sizes with increasing compile times.

inducer commented 2 years ago

Wow, not sure I'd seen that. 136,095 nodes? That's kinda nuts. There's no reason for there to be that many of them, IMO.

MTCam commented 2 years ago

This timing looks suspiciously exponential in the number of ranks. I do not understand why that would be the case.

@MTCam:

What exactly is being measured?

The timer starts at the first call to logpyle.tick_before. The time will include the compile of the RHS (and only the RHS) and the time of the first step. The step timings here are O(1s). @matthiasdiener can verify about what is being timed for this.

Do you happen to know the average/max number of neighbors per rank? The only well-justified cost growth in compilation would be linear in the max number of neighbors.

Not offhand - but I do know it isn't going crazy. I will add a snippet to give us the break-down.

The only well-justified cost growth in compilation would be linear in the max number of neighbors

Are you suggesting this might be because of the increased comm pytato nodes? If so, the lines in the log file of the form:
find_partition: Split 136095 nodes into 3 parts, with [41034, 8180, 88338] nodes in each partition.
would be helpful to validate that hypothesis for the 3 mesh sizes with increasing compile times.

FWIW, the compile times are independent of mesh size. For a given RHS, the compile times increase with increasing numbers of ranks. This is not a new "feature" of distributed lazy. It has always been this way - although now it may be a little worse than before. I am not sure because we were never able to run our cases out this far before now.

kaushikcfd commented 2 years ago

This is not a new "feature" of distributed lazy. It has always been this way - although now it may be a little worse than before.

It's safe to call it a bug that ought to be fixed, since, purely based on principles we don't expect it. If you have access to the log files for the 3 mesh sizes that would be helpful to debug this behavior.

MTCam commented 2 years ago

Do you happen to know the average/max number of neighbors per rank? The only well-justified cost growth in compilation would be linear in the max number of neighbors.

@inducer : I am going to post a running report on the number of rank neighbors for runs. As the data comes in, I'll edit this msg. I hope it can help us suss out whether the number of partition boundaries is playing a significant role. nranks	nnbrs (min, max, avg)
1	(0, 0, 0)
2	(1 , 1, 1)
4	(3, 3, 3)
8	(7, 7, 7)
16	(10, 11, 10.9)
32	(10, 15, )
64	(11, 19, )
128	(12, 19, ) most with 17, 18

MTCam commented 2 years ago

If you have access to the log files for the 3 mesh sizes that would be helpful to debug this behavior.

1.5M elems, 32 ranks: https://gist.github.com/db1139f6db451f0cd921fab6a6d69223 3M elems, 64 ranks: https://gist.github.com/07237afdfd5a561f250de8f1cb0244a9 6M elems, 128 ranks: https://gist.github.com/b9518b0f26d4a5269d5bf4082c76f9a5

Hope this helps! Thanks a ton for looking into this!

kaushikcfd commented 2 years ago

Thanks, so here is the ranks vs number of pytato nodes:

nranks	max. num pytato nodes across ranks
32	40703
64	76475
128	106285
256	136095

So, at least it's not growing linearly and starts tapering off once our number of neighboring ranks starts to taper off. So @inducer's hunch was correct, but I'm very surprised how come communication nodes outgrow the total numberof nodes on a single rank. Either something in grudge is adding redundant nodes or our materialization strategy in pytato is broken.

Because of these many nodes, we materialize more arrays and the generated loopy kernel has an increasing number of statements.	nranks	max. num lpy insns across ranks and parts
32	1710	145 secs
64	3204	435 secs
128	4449	807 secs

inducer commented 2 years ago

Either something in grudge is adding redundant nodes or our materialization strategy in pytato is broken.

I completely agree! FWIW, I think that's also consistent with where we left off in terms of the mysteriously-large sum of flux expressions in July. I strongly suspect that something there is causing us to do way more work than necessary. Indirectly, I think this is also what caused the unreasonably-large argument counts. I tried to figure out for a while what that might be, but I ran out of time before my trip. To allow us to run at all, we decided to pursue SVM, but I think the underlying issue remains.

MTCam commented 2 years ago

Do you happen to know the average/max number of neighbors per rank? The only well-justified cost growth in compilation would be linear in the max number of neighbors.

@inducer, @kaushikcfd You guys are amazing. Fixing the number of rank neighbors to 2, gives a ~fixed compile time regardless of nrank, and excellent weak scaling.

nranks	compile time	walltime/RHS
1	282s	.18
2	282	.182
4	368.1	.218
8	368.5	.219
16	373.9	.224
32	374	.227
64	372.8	.223

Potentially we can do this with our production/prediction cases by slicing them in the flow-normal direction.

inducer commented 2 years ago

Heh. I'm glad that works, but it's not exactly a practical constraint. :) I.e., we shouldn't have to do that. But if it serves as a stopgap for a bit, I guess I could live with that.

MTCam commented 2 years ago

Looking at this with @matthiasdiener, we were able to reproduce an allegedly pathological scaling for the DAG using the wave-op-mpi.py example in grudge.

For the wave example in mirgecom: nranks	wave@migecom	wave@grudge
1	168	166
2	253	250
4	407	402
8		554
16		648

A DAG viz can be found here, where we're tracking the ongoing effort. We plan on creating an even simpler example to get more insight about why the DAG is scaling this way.

inducer commented 2 years ago

You're right, that's a healthy amount of growth even just for the wave operator.

I find that the DAG pictures have low information density, i.e. it's hard to "see the forest", because the nodes are so big and so unevenly sized. I think we can help ourselves by making it so we can see more of the connectivity in one screenful.

MTCam commented 2 years ago

You're right, that's a healthy amount of growth even just for the wave operator.

I find that the DAG pictures have low information density, i.e. it's hard to "see the forest", because the nodes are so big and so unevenly sized. I think we can help ourselves by making it so we can see more of the connectivity in one screenful.

Even our super simple grad-only DG operator produced a pretty big graph with 58 nodes. So we rolled a very simple non-DG example that I think illustrates what's going on. Please have a look at the bottom of the running notes for a summary: https://notes.tiker.net/t96b1_VPRdSeWkzJfuRmQw?both

inducer commented 2 years ago

To be clear, some level of duplication in the DAG is expected because we do not carry a global "registry" of constant arrays (or data wrappers in pytato-speak). So every time pytato sees a constant (like an array used for indirect indexing, or an operator matrix), it views it as new and makes a separate graph initially. As close to the first step in the transform pipeline, this redundancy is removed.

To see what this will do, simply feed the DAG through deduplicate_data_wrappers and visualize it after that. If you still observe more nodes than you expect after that, then the issue you see is likely to be real.

MTCam commented 2 years ago

To be clear, some level of duplication in the DAG is expected because we do not carry a global "registry" of constant arrays (or data wrappers in pytato-speak). So every time pytato sees a constant (like an array used for indirect indexing, or an operator matrix), it views it as new and makes a separate graph initially. As close to the first step in the transform pipeline, this redundancy is removed.

To see what this will do, simply feed the DAG through deduplicate_data_wrappers and visualize it after that. If you still observe more nodes than you expect after that, then the issue you see is likely to be real.

The number of graph nodes reported, and created by the examples that reproduce and demonstrate the issue are unchanged by the deduplicate_data_wrappers utility. I guess we are probably reporting them after such duplications are eliminated.

I've been using this compile_trace_callback function:

 def my_ctc(prg_id, stage, prg):                                                                                                                              
     from pytato.transform import deduplicate_data_wrappers                                                                                                   
     if stage == "pre_find_distributed_partition":                                                                                                            
         print(f"NUM DAG NODES BEFORE DeDupe: {pt.analysis.get_num_nodes(prg)}")                                                                              
         prg_dedupe = deduplicate_data_wrappers(prg)                                                                                                          
         print(f"NUM DAG NODES AFTER DeDupe: {pt.analysis.get_num_nodes(prg_dedupe)}")                                                                        
         pt.show_dot_graph(prg_dedupe)                                                                                                                        
         1/0

I think that the example at the bottom of the notes demonstrates that the observed DAG scaling with BCs and with partition boundaries is actually expected. The "sub-dag" that represents the flux calculations is duplicated with every unique BC or partition boundary. Since communication/recv dag nodes are unique for each mpi neighbor, the flux calculation dag nodes are repeated for each mpi neighbor. We are struggling to come up with a viable plan for what to do about it.

inducer commented 2 years ago

I agree with you; we certainly splat a copy of the facial flux into the DAG once for each parallel boundary. I think there are probably other less-than-ideal things we're doing, but that one seems sufficiently straightforward to fix. The duplication of DAG nodes can be avoided by making an array that concatenate (via actx.np.concatenate) all the flux data into one array per "input", applies the flux function to all those arrays at in one big batch, and then saws them back apart into appropriately-sized arrays. The best bit about this is that you can try this in mirgecom right now, without infrastructure help.

I can see two possible gotchas, but maybe @kaushikcfd can help with those:

We need to make sure that the glued array is materialized and stays that way. (@kaushikcfd how likely is it that array contraction will rip undo an ImplStored tag on the glued array? - https://github.com/inducer/pytato/issues/362)
The "gluing" also needs to happen to things that the "user" (you, @MTCam :) doesn't consider inputs, but pytato does, i.e. things like normals and index arrays.

MTCam commented 2 years ago

(...) but that one seems sufficiently straightforward to fix.

This is great. My attempts so far have failed.

The duplication of DAG nodes can be avoided by making an array that concatenate (via actx.np.concatenate) all the flux data into one array per "input", applies the flux function to all those arrays at in one big batch, and then saws them back apart into appropriately-sized arrays. The best bit about this is that you can try this in mirgecom right now, without infrastructure help.

After Friday's discussion, we came up with a short list of things to try. I had thought that projecting all of the received data to a surface that was all of the remote partition boundaries union'd together would resolve this (duplicating not the flux calc, but instead just the projection), but @majosm suggested maybe just projecting to "all_faces" would be easier, quicker to try, so I tried this but it did not seem to resolve the issue. I think lazy was too smart and saw right through this attempt to hide where the data came from.

I hope this concatenation strategy works better. Based on the failed attempt to resolve by projecting to "all_faces", do you still expect concatenating the arrays will work around the issue?

What about "evaluating" each received array as a part of the "communication dag node"? Would that help?

I can see two possible gotchas, but maybe @kaushikcfd can help with those: (...)

The "gluing" also needs to happen to things that the "user" (you, @MTCam :) doesn't consider inputs, but pytato does, i.e. things like normals and index arrays.

Got it. The normals I think seem straightforward, but do we explicitly handle index arrays? I assume you are talking about index arrays down in the DOFArray or other discretization data structure beyond those that would be updated automatically by concatenate (if any). Can you or @kaushikcfd give some more specific hints about which index arrays we'll need to pay attention to and where they live?

kaushikcfd commented 2 years ago

I was going through MikeC's driver with a single $\nabla\cdot u$ and tried running it over 16 ranks, for which Metis gives the following partitioning: the_great_metis and we get the following RHS kernel for each rank: https://gist.github.com/kaushikcfd/13dcb39130c7e8f5195c224b751a8b90. The generated kernel seems very close to what one would write up by hand (at least after we have fixes for https://github.com/inducer/meshmode/pull/354 and https://github.com/inducer/meshmode/issues/355).

One odd characteristic of the generated code is the way the communicated values are handled, it is currently handled as--

Facial values are received from a neighbor. This is received as an array per field per neighbor.
Flux values are computed for each neighboring boundary, interior faces and boundary faces and stored into a separate array for each field.
The flux contributions from (2) are added up.

The problem is in step (2) where the flux computation for each neighboring rank is "duplicated", however, if one were to implement this by hand, one would concatenate all the flux values contributing to the interior faces and perform the numerical flux computation using this single concatenated array per field (instead for each of the $N{\text{field}}\cdot N{bdry}$ arrays that we currently have).

The current expression structure is not only adding too many nodes in the graph but also forcing the computation of smaller-sized arrays (per boundary facial values) which is a poor workload for wide SIMD devices.

I'm guessing this demands some changes in grudge.op.interior_trace_pairs.

(MikeC's my_add code from the notes also highlights the problem, but in that case, it is not trivial to concatenate the intermediate arrays to use the same subgraphs. However, this is not the case for DG)

majosm commented 2 years ago

One odd characteristic of the generated code is the way the communicated values are handled, it is currently handled as--

Facial values are received from a neighbor. This is received as an array per field per neighbor.

Flux values are computed for each neighboring boundary, interior faces and boundary faces and stored into a separate array for each field.

The flux contributions from (2) are added up.

The problem is in step (2) where the flux computation for each neighboring rank is "duplicated", however, if one were to implement this by hand, one would concatenate all the flux values contributing to the interior faces and perform the numerical flux computation using this single concatenated array per field (instead for each of the Nfield⋅Nbdry arrays that we currently have).

The current expression structure is not only adding too many nodes in the graph but also forcing the computation of smaller-sized arrays (per boundary facial values) which is a poor workload for wide SIMD devices.

I'm guessing this demands some changes in grudge.op.interior_trace_pairs.

One possible caveat: concatenating all of the arrays and computing the fluxes together would (I think) remove the ability to overlap communication/computation here. Not sure if this is something distributed lazy currently does/will do?

Also, if I understand correctly this is close to what @MTCam tried by projecting all of the trace pairs to "all_faces" and summing before passing them to the flux computation. Weird that it didn't seem to help. Though some work is done in interior_trace_pairs after receiving (opposite_face_connection, etc.), not sure how much that contributes.

Do we have a sense of what a reasonable number of nodes per flux computation should be? >5000 per trace pair (rough guess) seems like a lot?

kaushikcfd commented 2 years ago

(I think) remove the ability to overlap communication/computation here

The restructuring I was pointing to comes after the facial values are received, so I think it shouldn't hurt it.

Do we have a sense of what a reasonable number of nodes per flux computation should be? >5000 (rough guess) seems like a lot?

I think that with every new neighbor there should be <10 extra nodes in the DAG per flux component, another actx.np.where, ielement indirection, idof indirection, comm node, extra data wrappers.

MTCam commented 2 years ago

(I think) remove the ability to overlap communication/computation here

The restructuring I was pointing to comes after the facial values are received, so I think it shouldn't hurt it.

I share @majosm's concern here. The duplication of the flux computation is what allows the fluxes to be computed from some boundaries before communication has completed on all boundaries. Requiring all communication to complete before computing fluxes seems to work counter to an important advantage/feature of lazy.

Do we have a sense of what a reasonable number of nodes per flux computation should be? >5000 (rough guess) seems like a lot?

I think that with every new neighbor there should be <10 extra nodes in the DAG per flux component, another actx.np.where, ielement indirection, idof indirection, comm node, extra data wrappers.

I have not looked for the prediction driver yet, but for the example grad(u) only operator, the number of dag nodes for the flux computation is ~40. For a CNS fluid-only case it was ~1500. IIRC mixture pushed it up around 4000 per flux computation.

illinois-ceesd / mirgecom

Error at 256GPU/64Nodes on Lassen #777