trixi-framework / Trixi.jl

Trixi.jl: Adaptive high-order numerical simulations of conservation laws in Julia
https://trixi-framework.github.io/Trixi.jl
MIT License
533 stars 108 forks source link

Investigate (and fix) excessive memory use for parallel simulations #1353

Open sloede opened 1 year ago

sloede commented 1 year ago

At the moment it seems like we have considerable issues with non-parallelized memory usage when running massively parallel simulations. What do I mean by that?

When doing a weak scaling, i.e, a parallel scaling experiment where the problem size is fixed per rank, we always seem to reach a point where we run out of memory. For example, I am able to scale with a problem size of 1024 elements/rank, 128 ranks/node, to up to 16 nodes on Hawk. When going beyond that (e.g., 64 nodes), the jobs fail with an OOM error.

Maybe something else is at play here, but my first suspect would be that we somehow allocate memory that is of size O(#ranks) or O(nelements_global), and at some point this becomes just too much for the memory per node.

lchristm commented 1 year ago

This is quite surprising since the weak scaling on JUSUF (256 GB memory per node just like Hawk, same number of cores per node) worked fine on more than 64 nodes with 2048 elements per rank with Julia 1.7. If I recall correctly, we encountered increased memory usage with Julia 1.8 before.

That said, we definitely allocate a few O(#ranks) size arrays, e.g. here or in p4est here and here. These are just the few that immediately come to mind so there might be more.

sloede commented 1 year ago

Yeah, it was surprising to me too. But thanks for reminding me that it worked on JUSUF; maybe we can rerun the experiment there with Julia v1.8 as well.

In general I don't think that O(#ranks) size arrays should have a large impact - after all, we are still talking about only something like 16k or 32k ranks. I really wonder what's going on at Hawk, since the reason for my aborts is rather definite,

image

while I am sure that the number of elements/rank remains constant as it should 🤷‍♂️