npadmana / DistributedFFT

6 stars 2 forks source link

GPU version of the code #64

Open npadmana opened 4 years ago

npadmana commented 4 years ago

Elliot -- on my drive into work today, I found myself wondering what it would take to extend this code to use a GPU and cuFFT for the backend FFT.

We already use scratch space to do the YZ transform, so we would just need to allocate that via a CUDA call (which could use unified memory). We'd let the CUDA FFTs do all the heavy lifting (I took a quick look and I think they have all the functionality we need) and we'd use the Chapel code on the host to just coordinate the data transfer.

Now, the dominant time with the FFTs is the communication, but I'd hope we can overlap the data transfer between nodes with the data transfer to and from the GPU + communication, and so we might end up ahead.

Thoughts?

(tagging @ronawho )

ronawho commented 4 years ago

Yeah, that would be really interesting.

To start I think it'd be simplest to just target cuFFT from a single node with comm=none, and then move on to figuring out what comm would look like.

I'm really not familiar with the GPU scene, but I know some GPUs support direct RDMA. So far as I know our comm layers can't take advantage of that yet, so we'd probably need to copy-to-gpu, call cuFFT, copy-to-host, perform RDMA on host.

ronawho commented 4 years ago

From a cursory reading of https://docs.nvidia.com/cuda/cufft/index.html it looks like there is a cuFFTW to make initial porting easier (though cuFFT is supposed to have better performance.)

bradcray commented 4 years ago

Hi Nikhil — Another Chapel user, @carcarah, has been calling from Chapel to GPU libraries this week and may have experiences to share. The general approach he's been taking is to create a .so library containing the GPU kernels that exports C interfaces, and then to call into and link against them in the normal way. The initial drafts I saw on Monday were doing the cuda memcpys to get data from host memory to GPU memory within the library routines themselves. I know he's made more progress with it since then, but haven't had a chance to check in and see how it's evolved yet. In Monday's version, each locale would/could call into these GPU libraries simultaneously/independently (operating on node-local data only, obviously... in fact in that first draft, the CUDA libraries were allocating the host memory as well, but one of the obvious directions he's taking it is to pass (local) Chapel arrays into the library routines). I'll let him chime in if he wants to share more.

npadmana commented 4 years ago

@bradcray, @carcarah - Any prior experiences would be great!

I think our problem is actually a lot simpler, since I'm not looking at writing any real CUDA kernels, just calling into library functions.

@ronawho -- yes, the cuFFTW approach should probably be the first thing we should try.

tcarneirop commented 4 years ago

Hello Everyone.

The version of yesterday receives data from Chapel via C types, calls the backtracking kernel, and sends data back to Chapel.

Yesterday I did not have time to perform the sanity tests. I think they are going to be performed today. After that, I'll provide more details.

Tiago Carneiro.

Em qui., 5 de dez. de 2019 às 18:39, npadmana notifications@github.com escreveu:

@bradcray https://github.com/bradcray, @carcarah https://github.com/carcarah - Any prior experiences would be great!

I think our problem is actually a lot simpler, since I'm not looking at writing any real CUDA kernels, just calling into library functions.

@ronawho https://github.com/ronawho -- yes, the cuFFTW approach should probably be the first thing we should try.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/npadmana/DistributedFFT/issues/64?email_source=notifications&email_token=AA6TIVN4X2GVWFZXLDRKFGDQXE4FPA5CNFSM4JV3RBWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGBQNFY#issuecomment-562235031, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA6TIVN7HSBEDW2ASFDYRULQXE4FPANCNFSM4JV3RBWA .

npadmana commented 4 years ago

Thinking about this a little, it reminds me of a feature I've wanted for a while which is to treat a C allocated block of memory as a local Chapel array. Of course, the domain would have to be user-specified and the user would be responsible for making sure there was enough memory, freeing etc. Mostly, I just want n-dimensional indexing and iterators (serial/leader/follower/standalone).....

bradcray commented 4 years ago

a feature I've wanted for a while which is to treat a C allocated block of memory as a local Chapel array

I think that'd be a reasonable feature request. I think it would not be difficult to implement and that the main challenges would be (a) defining what the interface should look like and (b) determining what the memory management rules are. That said, I believe that this would end up being similar to what's been recently added for creating Chapel strings from C buffers, so that might suggest a pattern to follow.

tcarneirop commented 4 years ago

Hello Everyone,

I'm glad to say that it works. I'm able to call a kernel from Chapel and retrieve the results. Chapel passes the data to be processed by CUDA and several other parameters.

I've performed some sanity check and looks like everything works fine.

Tiago Carneiro.

Em qui., 5 de dez. de 2019 às 20:19, Brad Chamberlain < notifications@github.com> escreveu:

a feature I've wanted for a while which is to treat a C allocated block of memory as a local Chapel array

I think that'd be a reasonable feature request. I think it would not be difficult to implement and that the main challenges would be (a) defining what the interface should look like and (b) determining what the memory management rules are. That said, I believe that this would end up being similar to what's been recently added for creating Chapel strings from C buffers, so that might suggest a pattern to follow.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/npadmana/DistributedFFT/issues/64?email_source=notifications&email_token=AA6TIVJJAWBE2UG5NUITC3LQXFH4FA5CNFSM4JV3RBWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGB2BVI#issuecomment-562274517, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA6TIVMW4XDT2CR6BJ66KMTQXFH4FANCNFSM4JV3RBWA .

npadmana commented 4 years ago

@bradcray -- turns out we discussed this on SO a while ago (but never got captured as an issue).

https://stackoverflow.com/questions/53821252/best-way-to-wrap-a-c-array-pointer-as-a-chapel-array

Looking back at this, do you have a sense of what the best route might be? I'll note that for the cases I care about, the record route I tried would work pretty well assuming I could define the parallel iterators.

bradcray commented 4 years ago

do you have a sense of what the best route might be?

I'd file the feature request and if you're feeling particularly generous, to propose an interface (which we'll then change and improve on, I imagine, but just to get things rolling from a design standpoint). The implementation should not be difficult I don't think, and my SO response was more about hacking something together without changes to the implementation whereas I think the request is reasonable to promote to an actual feature.

For inspiration, here's a pointer to the string routines that I was referencing earlier:

https://chapel-lang.org/docs/master/builtins/String.html#String.createStringWithBorrowedBuffer

(where, as you noted, arrays would need to be different in order to specify a domain or something like that).

(PS — I thought I'd remembered it from somewhere but couldn't find it under GitHub issues... glad I wasn't crazy!)