PencilArrays + CUDA - Githubissues

tomchor commented 3 years ago

Hello! Let me start by thanking you for this project! It's super useful and it's impressive that it's benchmarks are better than P3DFFT!

I'm currently working with and collaborating to Oceananigans, which has started using your packages (both this one and PencilFFTs.jl) to implement some distributed functionality. @ali-ramadhan implemented that and it's been working well, but really the end goal for us is to parallelize our computations across multiple GPUs and I think the first step is to make this package play nice with CUDA.

So here are a couple of questions if you don't mind:

Do you have plans to implement that functionality?
What would be the way to do it? (And I guess how easy to you anticipate this to be?)

Thanks!

CC: @francispoulin @ali-ramadhan

francispoulin commented 3 years ago

So far in the shallow water model we do have pencil arrays working that we have tested with CPUs and they seem to work well. More tests need to be done of course.

We also have the functionality of distirubting the solution across different GPUs. We have a proof of concept in shallow water model but have no done any real testing. This got us looking at how the model does in GPUs vs CPUs and we are doing some tests in shallow water model. So far GPUs are performing very well, maybe too well?

Since GPUs can do operations faster, I don't think you would need to distrubte across as many as you do in CPUs, but even a few would be helpful when the model size gets big enough.

I think this would be a great thing to do and would like to play with this in shallow water. I would guess that this could extend to other models, but only issue would be the pressure solve, as usual.

Maybe @ali-ramadhan would have more to add from his perspective?

christophernhill commented 3 years ago

@ascheinb was playing around with this a bit. @ascheinb, @glwagner and me did find that PencilArrays seems quite happy with CuArrays - which is awesome. @ascheinb was trying an example that then pushed some of the arrays through the CUDA FFT spaghetti!

jipolanco commented 3 years ago

Thank you all for your interest! I'm happy to know that PencilArrays and PencilFFTs are already being used in Oceananigans for parallelisation on CPUs, and I'd be glad to help making them work on GPUs.

To answer @tomchor's questions,

Do you have plans to implement that functionality?

In principle it's not in my immediate plans, since for now I haven't personally played much with GPUs (let alone multi-GPU systems), and my main motivation is on CPUs. But, as discussed in https://github.com/jipolanco/PencilFFTs.jl/issues/3, I would be very happy to add support for GPU arrays and in particular for CuArrays.

What would be the way to do it? (And I guess how easy to you anticipate this to be?)

From the PencilArrays side, the first thing to do is to determine what works and what doesn't on CuArrays. I guess the main difficulty is data transpositions, that I presume you will still need for multidimensional FFTs on GPUs. These would need to be done using CUDA-aware MPI. It would be good to know whether they work transparently using the current implementation. To be honest, I would be very (pleasantly) surprised if this is was case.

PencilFFTs will also need support for CUDA FFT plans, which I guess should be easy. If I understand correctly, to build a CUDA FFT plan one just needs to call FFTW.plan_* with a CuArray, is that right? In that case, it may actually work without any changes.

@ascheinb, @glwagner and me did find that PencilArrays seems quite happy with CuArrays - which is awesome.

@christophernhill That's really good news! I would love to hear more details on this. Have you tested transpositions in particular?

christophernhill commented 3 years ago

@jipolanco we did some very simple transpose tests and they seemed to work - which was awesome.

We added

using CUDA

and switched to

  Ax = PencilArray(pen_x, CuArray{Float64}(undef, (42, 31, 29)))

and

 Ay = PencilArray(pen_y, CuArray{Float64}(undef, (42, 31, 29)))

in https://github.com/jipolanco/PencilArrays.jl#quick-start

and things ran OK on a single rank.

So not a very comprehensive test, but a good start. I can try a couple more things tomorrow.

jipolanco commented 3 years ago

and things ran OK on a single rank.

So not a very comprehensive test, but a good start. I can try a couple more things tomorrow.

I agree, that's a very good start!

A few tests that I can suggest right now:

still on a single GPU, it would be good to know whether dimension permutations work correctly. This is in particular important in PencilFFTs, as it allows to always perform one-dimensional FFTs on contiguous memory. From what I understand, this is actually a requirement on CUDA FFT? (As opposed to FFTW, where transformed data doesn't need to be contiguous).

To test this, one can simply change the pen_y definition in Quick start to something like:
```
pen_y = Pencil(pen_x; decomp_dims = (1, 3), permute = Permutation(2, 1, 3))
```
and the Ay definition in @christophernhill's example to (note that dimensions are swapped):
```
Ay = PencilArray(pen_y, CuArray{Float64}(undef, (31, 42, 29)))
```
If possible, it would of course be nice to test things out on a few GPUs. A couple of GPUs should be enough to have an idea of whether things work correctly.
I also wonder whether gather works correctly with CuArrays. If it does, it should return a regular Array on the CPU, with the combined data from all GPUs. Note that, in its current implementation, gather is not meant to be used in production code, and it's mainly for testing. But I guess it would be interesting to add an optimised version that could be used to transfer data from all GPUs into a single CPU.

By the way, @christophernhill's example makes me think that it would be convenient to have a higher level constructor for PencilArrays wrapping CuArrays (and more generally, arbitrary dense array types). I'm thinking about something like:

PencilArray{T}([array_type = Array], undef, pencil::Pencil)

which would be consistent with the current PencilArray{T}(undef, pencil) constructor. Or maybe the undef above could be removed, for more simplicity? I'm open to all suggestions.

christophernhill commented 3 years ago

@jipolanco I can try some dims variations and gather.

I put some WIP pieces here

https://github.com/christophernhill/PencilArrays.jl/tree/master/gpu-experimenting

I will chat with @ali-ramadhan about some configuring some auto testing on a GPU machine in case this gets real.

So far still looks good. Looks like fill! ( https://github.com/christophernhill/PencilArrays.jl/blob/b12a127275f9932dbc825574e49f1de6a85d736a/gpu-experimenting/p1.jl#L24 ) does some element by element things e.g. see these messages that come from fill! https://github.com/christophernhill/PencilArrays.jl/blob/master/out/1/rank.00/stderr . That is probably inevitable for some initializations, but for some it should be possible to have an interface that is GPU optimized.

jipolanco commented 3 years ago

Thanks @christophernhill for the tests!

I will chat with @ali-ramadhan about some configuring some auto testing on a GPU machine in case this gets real.

That would be great!

Looks like fill! ( https://github.com/christophernhill/PencilArrays.jl/blob/b12a127275f9932dbc825574e49f1de6a85d736a/gpu-experimenting/p1.jl#L24 ) does some element by element things

You seem to be right about this. Since there is no specific fill! definition for PencilArrays, the call falls back to the generic definition for AbstractArrays in Julia base, which fills the array element by element. This can be easily fixed by defining fill!(A::PencilArray, x) = fill!(parent(A), x), as I'm guessing that CUDA.jl provides an optimised definition of fill!. I'll push a fix for this. There may be other functions that will also need to be defined according to this logic.

Otherwise, from your initial tests, I'm just a bit worried about the last line in out/1/rank.00/stdout:

maxAy: NaN

which suggests that something went wrong in the transposition of GPU data. It would be interesting to determine where things fail exactly.

christophernhill commented 3 years ago

@jipolanco oops - I think I pushed the output from a test where I deliberately commented out the AyC transpose to check it would error! Sorry about that 🔥 !!

With the transpose in place the results do match. I will tidy and update a bit later when I do gather etc... tests.

jipolanco commented 3 years ago

That's great then!! I'm happy to know that it just works!

christophernhill commented 3 years ago

@jipolanco it looks like a dimension permutation test works ( updated code here https://github.com/christophernhill/PencilArrays.jl/blob/02a648481fdd5273b4f97d18a848690599dbc3b5/gpu-experimenting/p1.jl#L38 ).

gather() seems to have problems (see error message below). It looks like a receive problem that deadlocks with a bad address. I am not sure if it is PencilArrays or some lower level that has the problem. Looking at gather code its not clear to me that it should fail.

I don't think gather is critical/needed for our FFT use case, so generally things still look good. Also broadcast to a local CPU copy seems to work ( https://github.com/christophernhill/PencilArrays.jl/blob/02a648481fdd5273b4f97d18a848690599dbc3b5/gpu-experimenting/p1.jl#L76 ), so a hack could be dispatch gather on CuArray to first do an intermediate local to rank CPU broadcast copy and then dispatch to a CPU gather?

The current tests are on a 4 Titan V system with PCI backplane. The Titan V has unified virtual memory, so I thought it might work. I am going to try on a higher end system with NVLink etc....

messages from gather on GPU memory Pencil (https://github.com/christophernhill/PencilArrays.jl/blob/02a648481fdd5273b4f97d18a848690599dbc3b5/gpu-experimenting/p1.jl#L81 ) when it is active .

[1616114675.616675] [tartarus:31858:0]         cma_ep.c:113  UCX  ERROR process_vm_readv delivered 0 instead of 24304, error message Bad address
[1616114675.616716] [tartarus:31858:0]         cma_ep.c:113  UCX  ERROR process_vm_readv delivered 0 instead of 24304, error message Bad address
[1616114675.616729] [tartarus:31858:0]         cma_ep.c:113  UCX  ERROR process_vm_readv delivered 0 instead of 24304, error message Bad address
[1616114675.616753] [tartarus:31858:0]         cma_ep.c:113  UCX  ERROR process_vm_readv delivered 0 instead of 24304, error message Bad address
[1616114675.616762] [tartarus:31858:0]         cma_ep.c:113  UCX  ERROR process_vm_readv delivered 0 instead of 24304, error message Bad address
[1616114675.616774] [tartarus:31858:0]         cma_ep.c:113  UCX  ERROR process_vm_readv delivered 0 instead of 24304, error message Bad address
[1616114675.616785] [tartarus:31858:0]         cma_ep.c:113  UCX  ERROR process_vm_readv delivered 0 instead of 27776, error message Bad address
[1616114675.616797] [tartarus:31858:0]         cma_ep.c:113  UCX  ERROR process_vm_readv delivered 0 instead of 27776, error message Bad address
[1616114675.616810] [tartarus:31858:0]         cma_ep.c:113  UCX  ERROR process_vm_readv delivered 0 instead of 27776, error message Bad address
[1616114675.690988] [tartarus:31858:0]         cma_ep.c:113  UCX  ERROR process_vm_readv delivered 0 instead of 24304, error message Bad address
[1616114676.136247] [tartarus:31858:0]         cma_ep.c:113  UCX  ERROR process_vm_readv delivered 0 instead of 24304, error message Bad address

jipolanco commented 3 years ago

@jipolanco it looks like a dimension permutation test works ( updated code here https://github.com/christophernhill/PencilArrays.jl/blob/02a648481fdd5273b4f97d18a848690599dbc3b5/gpu-experimenting/p1.jl#L38 ).

@christophernhill Awesome, apparently there's not much work to do after all.

gather() seems to have problems (see error message below). It looks like a receive problem that deadlocks with a bad address. I am not sure if it is PencilArrays or some lower level that has the problem. Looking at gather code its not clear to me that it should fail.

Your latest commit seems to confirm that it's a lower level issue, as gather seems to work with a different MPI implementation. Is that right?

I don't think gather is critical/needed for our FFT use case, so generally things still look good. Also broadcast to a local CPU copy seems to work ( https://github.com/christophernhill/PencilArrays.jl/blob/02a648481fdd5273b4f97d18a848690599dbc3b5/gpu-experimenting/p1.jl#L76 ), so a hack could be dispatch gather on CuArray to first do an intermediate local to rank CPU broadcast copy and then dispatch to a CPU gather?

I agree, that would be a good solution. The only thing is, I would prefer not having CUDA.jl as a hard dependency. But, taking inspiration from Oceananigans, I'm thinking that such a dispatch can be done more generically by defining CPU and GPU singleton types. Alternatively, we could use Requires to include CUDA-specific code.

christophernhill commented 3 years ago

@jipolanco it does look like the gather() issue is not present with mvapich2, only with openmpi. So its most likely its a lower level stack issue.

Requires sounds worth a try. We can take a quick look if you like?

@jipolanco it may make sense for you to register something with https://github.com/JuliaGPU/buildkite at some point? That would give a way to potentially do some GPU CI against main PencilArrays using BuildKite. There are a bunch of registration instructions here https://github.com/JuliaGPU/buildkite .

jipolanco commented 3 years ago

Requires sounds worth a try. We can take a quick look if you like?

@christophernhill Of course, feel free to look into this. Note that I'm already using Requires to optionally load HDF5.jl as a parallel I/O backend, so I guess you could start from there.

@jipolanco it may make sense for you to register something with https://github.com/JuliaGPU/buildkite at some point? That would give a way to potentially do some GPU CI against main PencilArrays using BuildKite. There are a bunch of registration instructions here https://github.com/JuliaGPU/buildkite .

Yes, that would make sense. If I understand correctly, for this I should transfer PencilArrays to a GitHub organisation? I guess I could transfer it to JuliaParallel, or I could create my own organisation.

christophernhill commented 3 years ago

Yes, that would make sense. If I understand correctly, for this I should transfer PencilArrays to a GitHub organisation? I guess I could transfer it to JuliaParallel, or I could create my own organisation.

@jipolanco I believe you can register with Buildkite as an individual too. The organization may be fuzzy language. @maleadt and/or @vchuravy may be able to clarify. In my opinion Pencils would be a great fit/addition for JuliaParallel if you do decide to go that route.

maleadt commented 3 years ago

It's possible to add BuildKite CI to personal repos, but it's much easier when you use one of the already set-up organizations.

glwagner commented 3 years ago

Maybe JuliaArrays?

jipolanco commented 3 years ago

Maybe JuliaArrays?

Thanks for the suggestion. That would definitely make sense for PencilArrays, but I'm not sure if PencilFFTs (which should also be tested on GPUs at some point) would be a good fit there.

glwagner commented 3 years ago

Would JuliaMath be appropriate for PencilFFTs? It looks like FFTW.jl is hosted there.

jipolanco / PencilArrays.jl

PencilArrays + CUDA #21