smillerc / MPIHaloArrays.jl

An array type for MPI halo data exchange in Julia
MIT License
22 stars 3 forks source link

Roadmap #6

Open smartalecH opened 1 year ago

smartalecH commented 1 year ago

This package looks great! It's simple and flexible enough for the kinds of finite-difference problems im interested in tackling.

Can I ask what your roadmap looks like? Have you tried performing large-scale analysis (100+ procs) to stress test this? Have you explored asynchronous halo updates when using eg GPUs? Do you have a list of tasks you'd like to work on, but don't have the bandwidth? Are there any "sharp bits" you are avoiding?

smillerc commented 1 year ago

Thanks! It's been a bit since I last worked on MPIHaloArrays, but had been planning on incorporating it in a high-performance finite-volume hydrodynamics code I've been developing (within the next month or so) .

I haven't done any major stress-testing for it yet -- I have access to the core-counts you mention, so it's just a matter of running/writing some good tests.

Some of the features I'd like to see implemented are (in no particular order):

  1. User-selectable communication backends, e.g. p2p (default), one-sided, shared-memory.
    • I started experimenting with one-sided, but never finished it. One-sided get/puts will certainly provide some extra performance
  2. Graph topology: The only topology currently supported is Cartesian, but I'd like to see graph as an option too.
    • When I started writing MPIHaloArrays, it wasn't an option yet in MPI.jl (but is now)
  3. GPU-awareness
    • I just haven't done any testing with this, so I don't have a good idea what needs to be added
    • This is the area I know the least about
  4. Performance tweaking: I think there is some performance to be gained by some caching for the halo/ghost cell regions during communication, but I haven't done enough benchmarking here to know for sure.

I'm open to ideas/contributions!

I'd say that the GPU-awareness is one area that I don't have the bandwidth for at the moment. I'm not as familiar with it, so it's hard to judge what it would take. Stress testing it will also likely show where the weak-points are and provide the opportunities for improvement. I wanted the package to be simple to use and "just work". I haven't done a ton of fine-tuning yet.

smartalecH commented 1 year ago

This is great! Thanks for the thorough reply.

I'm happy to test this using native GPU arrays. I'm also interested in enabling basic RMA patterns (and can contribute PRs etc).

It may be useful to build a few routines that generate different kinds of iterators for each array. For example, currently the code provides the indices of the inner and halo regions (very useful). But with GPU halo codes, we often like to split the kernel into two fundamental tasks that can execute in tandem (GPUs allow for thousands of "streams" which further parallelize to thousands of computational cores): (1) update the "exterior" of the interior region, followed by a communication step broadcasting the halo update; (2) update the "interior" of the interior region. The challenge then is to "load balance" these two operations/kernels such that the GPU can effectively "hide" the communication overhead behind the computation. As such, it would be great to have a function that returns the appropriate iterators for a given "exterior region" (of the interior) provided by the user, such that looping over these regions is rather seamless.

Finally, I'll add that many halo codes are multi-physics, such that different regions in space/time will have (slightly) different kernels that operate on additional arrays. Naively, one might just allocate all arrays throughout space and use the most general kernel throughout the entire domain. But this not only wastes a tremendous amount of memory, but quickly saturates the memory-bandwidth (the resource-bound quantity for many halo codes). To overcome this, people often chunk the domain by kernel, not just by proc. In other words, a single MPI proc may have multiple array chunks (that can be dispatched asynchronously on the same gpu). It might be worth thinking about how such an approach could best leverage MPIHaloArrays in a cheap fashion (with minimal bookkeeping).

smillerc commented 1 year ago

When you say iterators, do you mean Tuples of indices? At the moment MPIHaloArrays doesn't provide iterators per-se, but sets of indices. You then create an iterator via ilo:ihi, CartesianIndicies(ilo:ihi) or something similar. Adding new functions to provide indices to different regions of the array would be simple to add.

I was also contemplating using OffsetArrays such that the array indices of each MPI chunk are based on their global position. Right now each array defaults to starting at 1. This means that during gather/scatter operations, there is some bookkeeping needed to place the chunks in the proper location. The only issue with offset indices is the potential for confusion and opportunities for users to use loop iterations such as for j in 1:length(A), which is generally discouraged in Julia anyways.

I'm looking forward to what you find out with testing GPU arrays.

smartalecH commented 1 year ago

When you say iterators, do you mean Tuples of indices?

I was actually referring to native iterators (like the CartesianIndicies(ilo:ihi) you suggested). The reason being it's nice to have the iterator itself be abstracted from the user (such that I don't have to think about how the loop is starting or ending, what proc I'm on, etc). Of course, you'd want to keep your tuple functions. But just add a simple wrapper that calls these to produce various useful iterators.

I was also contemplating using OffsetArrays such that the array indices of each MPI chunk are based on their global position.

This is 100% your package, but I'll give you my 2 cents. Personally, I think offsetting each array by each proc location has a few drawbacks. For one, you are limiting the functionality of the halo array to cartesian-like grids (for the most part). But more importantly, this seems to go against the MPI "flow" where every proc "does the same thing" just on different data. Now, if each array is localized to a global grid, I can imagine some mental gymnastics needing to take place for various operations.

But I do like the use of offset arrays. Instead, I would use it to reference the halo region. This way you can use simple broadcasting (into the halo) and not worry about overstepping. For example:

function my_kernel(A,B,dx)
B[:] = 1 / dx * (A[1:end] - A[0:end-1])
end

where the indices 0 and end actually correspond to the halo.

(I know julia typically discourages vectorized broadcasting... but with CUDA.jl it's typically recommended to start with broadcasting and then gradually fuse your loops as needed, since the compiler is pretty dang good already).