example for R2C transform

amigalemming commented 9 years ago

I am trying to write an accelerate wrapper for real-to-complex and complex-to-real transforms. While complex-to-complex transforms work perfectly, the real-to-complex transforms aborts with CUFFT Exception: failed to execute an FFT on the GPU. I tried to track the problem using ltrace, but the call to cufftExecR2C is not detected by ltrace. (Btw. accordingly the call to cufftExecC2C is missing in a working complex-to-complex transform.) So may I ask you to write a minimalistic example (without accelerate) that performs a real-to-complex transform? I am currently trying to write my own one, but it fails with CUDA Exception: initialization error.

amigalemming commented 9 years ago

Here is what I tried:

module Main where

import qualified Foreign.CUDA.Driver as CUDA
import qualified Foreign.CUDA.FFT as CUFFT

main :: IO ()
main =
   CUDA.withListArrayLen [1,0,0,0] $ \width inp -> do
      out <- CUDA.mallocArray ((div width 2 + 1) * 2)
      h <- CUFFT.plan1D width CUFFT.R2C 1
      CUFFT.execR2C h inp out

Some initialization seems to be missing. It is not mentioned in the according CUFFT example. I tried to extract it from accelerate-cuda, but this seems to become complicated. I also do not know how to assert Complex compliant memory alignment for the Real array, as demanded by the CUFFT docs.

robeverest commented 9 years ago

I was able to get your example to work by importing Foreign.CUDA.Runtime (or just Foreign.CUDA) instead of Foreign.CUDA.Driver. The Driver API is intended to be more low level, so requires you to do all the initialisation yourself.

As for proper alignment, I believe cudaMalloc guarantees that returned pointers are aligned at 256-byte boundaries, so that shouldn't be a problem.

tmcdonell commented 9 years ago

To initialise with Foreign.CUDA.Driver

initialise []
dev <- device 0
ctx <- create dev []

At the end you should probably destroy ctx as well, but (I am pretty sure) you can leave it to the final cleanup when the host process dies and everything gets shut down.

amigalemming commented 9 years ago

Using Foreign.CUDA.Runtime works nicely. Now I want to use this from accelerate. Creating a cufft plan within CUDA.run1 works:

module Main where

import qualified Data.Array.Accelerate.CUDA.Foreign as AF
import qualified Data.Array.Accelerate.CUDA as CUDA

import qualified Foreign.CUDA.FFT as CUFFT

import qualified Data.Array.Accelerate as A
import Data.Array.Accelerate (Acc, Vector, Z(Z), (:.)((:.)), )

transformForeign :: Vector Float -> AF.CIO (Vector Float)
transformForeign input =
   let (Z:.inlen) = A.arrayShape input
       outlen = (div inlen 2 + 1) * 2
   in  do
       output <- AF.allocateArray (Z:.outlen)
       ((), iptr) <- AF.devicePtrsOfArray input
       ((), optr) <- AF.devicePtrsOfArray output
       AF.liftIO $ do
          h <- CUFFT.plan1D inlen CUFFT.R2C 1
          CUFFT.execR2C h iptr optr
       return output

transform :: Acc (Vector Float) -> Acc (Vector Float)
transform =
   A.foreignAcc
      (AF.CUDAForeignAcc "transformForeign" transformForeign)
      (error "no fft fallback implemented")

main :: IO ()
main =
   print $ CUDA.run1 transform $ A.fromList (Z:.5) [1,0,0,0,0]

Now I like to create the plan outside of CUDA.run1 in order to share it between multiple calls to execR2C. I try the manual initialization as suggested by tmcdonell, but it fails with CUFFT Exception: failed to execute an FFT on the GPU:

module Main where

import qualified Data.Array.Accelerate.CUDA.Foreign as AF
import qualified Data.Array.Accelerate.CUDA as CUDA

import qualified Foreign.CUDA.FFT as CUFFT

import qualified Foreign.CUDA.Driver as Driver

import qualified Data.Array.Accelerate as A
import Data.Array.Accelerate (Acc, Vector, Z(Z), (:.)((:.)), )

transformForeign :: CUFFT.Handle -> Vector Float -> AF.CIO (Vector Float)
transformForeign h input =
   let (Z:.inlen) = A.arrayShape input
       outlen = (div inlen 2 + 1) * 2
   in  do
       output <- AF.allocateArray (Z:.outlen)
       ((), iptr) <- AF.devicePtrsOfArray input
       ((), optr) <- AF.devicePtrsOfArray output
       AF.liftIO $ CUFFT.execR2C h iptr optr
       return output

transform :: CUFFT.Handle -> Acc (Vector Float) -> Acc (Vector Float)
transform h =
   A.foreignAcc
      (AF.CUDAForeignAcc "transformForeign" $ transformForeign h)
      (error "no fft fallback implemented")

main :: IO ()
main = do
   Driver.initialise []
   dev <- Driver.device 0
   ctx <- Driver.create dev []

   let inlen = 5
   h <- CUFFT.plan1D inlen CUFFT.R2C 1
   print $ CUDA.run1 (transform h) $ A.fromList (Z:.inlen) [1,0,0,0,0]

robeverest commented 9 years ago

The reason you're getting this error is because plan1D is being called from a different CUDA context to execR2C. You're creating a context and setting it as the current context with Driver.create, but Accelerate uses its own context in run1.

The best way to guarantee that your plan* functions will be called in the right CUDA context is to use inDefaultContext from Data.Array.Accelerata.CUDA.Foreign.

inDefaultContext :: CIO a -> IO a

This executes the given CIO action under the same context as Accelerate runs its computations. You can then use liftIO to lift IO into CIO

So your main would become:

main :: IO ()
main = do
   let inlen = 5
   h <- AF.inDefaultContext $ AF.liftIO $ CUFFT.plan1D inlen CUFFT.R2C 1
   print $ CUDA.run1 (transform h) $ A.fromList (Z:.inlen) [1,0,0,0,0]

amigalemming commented 9 years ago

On Wed, 29 Oct 2014, robeverest wrote:

The reason you're getting this error is because plan1D is being called from a different CUDA context to execR2C. You're creating a context and setting it as the current context with Driver.create, but Accelerate uses its own context in run1.

The best way to guarantee that your plan* functions will be called in the right CUDA context is to use inDefaultContext from Data.Array.Accelerata.CUDA.Foreign.

inDefaultContext :: CIO a -> IO a

This executes the given CIO action under the same context as Accelerate runs its computations. You can then use liftIO to lift IO into CIO

So your main would become:

main :: IO () main = do let inlen = 5 h <- AF.inDefaultContext $ AF.liftIO $ CUFFT.plan1D inlen CUFFT.R2C 1 print $ CUDA.run1 (transform h) $ A.fromList (Z:.inlen) [1,0,0,0,0]

I see. I want to provide the 'plan' and 'exec' functions (separately) by a library. Your solution would mean that plans are always created in the default context, and thus the 'exec' function is bound to the default context, too. Is there a way to give the user more flexibility and still assert that plans and execs are performed in the same context? I guess this results in a tagging style like in the ST monad. A little bit simpler solution could be to provide the 'plan' function only in the CIO monad. This way the programmer has to decide actively in which context to do the planning.

amigalemming commented 9 years ago

On Wed, 29 Oct 2014, robeverest wrote:

The best way to guarantee that your plan* functions will be called in the right CUDA context is to use inDefaultContext from Data.Array.Accelerata.CUDA.Foreign.

inDefaultContext :: CIO a -> IO a

This executes the given CIO action under the same context as Accelerate runs its computations. You can then use liftIO to lift IO into CIO

Unfortunately, it does not run CIO in IO, but runs an IO actions: inDefaultContext :: IO a -> IO a

robeverest commented 9 years ago

Unfortunately, it does not run CIO in IO, but runs an IO actions: inDefaultContext :: IO a -> IO a

Oh sorry, yes. I was thinking back to a previous design.

I see. I want to provide the 'plan' and 'exec' functions (separately) by a library. Your solution would mean that plans are always created in the default context, and thus the 'exec' function is bound to the default context, too. Is there a way to give the user more flexibility and still assert that plans and execs are performed in the same context? I guess this results in a tagging style like in the ST monad. A little bit simpler solution could be to provide the 'plan' function only in the CIO monad. This way the programmer has to decide actively in which context to do the planning.

I'm not sure I follow. While I can understand why you would want to separate the planning and execution, I don't see why you can't just have the 'plan' function return the 'exec' function. For example, supposing you're wanting to support different contexts, having something like this.

plan1DR2C :: Context -> Int -> IO (DevicePtr Float -> DevicePtr Float -> IO ())

This is basically how we do it in accelerate-fft. Calling the fft* functions with just the size arguments gives you back an accelerate function that you can use many times, all under the same plan.

amigalemming commented 9 years ago

On Thu, 30 Oct 2014, robeverest wrote:

I'm not sure I follow. While I can understand why you would want to separate the planning and execution, I don't see why you can't just have the 'plan' function return the 'exec' function.

That's what I am already doing. But it becomes difficult to force that the exec function will be run in the same context as the plan.

For example, supposing you're wanting to support different contexts, having something like this.

plan1DR2C :: Context -> Int -> IO (DevicePtr Float -> DevicePtr Float -> IO ())

The returned 'exec' function will be run within a CUDA.run1 function, and then it should be asserted that it runs in the same context as the plan creator.

In accelerate-fft you pull the planning into the context of the exec by using unsafePerformIO. What happens if you share a partially applied call to 'fft' and run it in different contexts with CUDA.run1In? E.g.

let transform = fft shape in CUDA.run1In ctx0 transform ... CUDA.run1In ctx1 transform

?

robeverest commented 9 years ago

In accelerate-fft you pull the planning into the context of the exec by using unsafePerformIO. What happens if you share a partially applied call to 'fft' and run it in different contexts with CUDA.run1In? E.g.

let transform = fft shape in CUDA.run1In ctx0 transform ... CUDA.run1In ctx1 transform

?

Ah, I see what you mean. Currently that would fail.

I've been meaning for a while to to change accelerate-fft so that it used a proper cache of plans instead of the partial application trick. That would fix this problem. It's similar to the trick we use for keeping track of arrays on the device. We have a hash table where the keys are weak pointers to the host arrays and the values are device pointers. The idea is that we can remove deallocate a device array when the host array is garbage collected.

I was thinking we could do the same with CUFFT plans. Have a table where the key is a tuple consisting of the dimensions of the array and a weak pointer to the context. That would solve the problem of multiple contexts (running the same computation under a different context would result in a new plan being created). That may result in plans being kept alive much longer than they need to, but I don't know of any current application we have where that would be an issue. Unless you have one?

The other advantage of this is it would remove the need to pass in the dimensions of the array as a separate argument to the fft* functions.

amigalemming commented 9 years ago

On Thu, 30 Oct 2014, robeverest wrote:

I was thinking we could do the same with CUFFT plans. Have a table where the key is a tuple consisting of the dimensions of the array and a weak pointer to the context. That would solve the problem of multiple contexts (running the same computation under a different context would result in a new plan being created). That may result in plans being kept alive much longer than they need to, but I don't know of any current application we have where that would be an issue. Unless you have one?

For my patch-image program this would not be a good idea. I already have to shrink the images in order to perform a convolution with my available GPU memory. Additionally I have tested to determine the image orientation using a Fourier based Radon transformation. This would require a custom plan for every image that is used only once.

To my surprise the CUFFT doc says, that a plan consumes memory proportionally to the whole image (2.2. Fourier Transform Setup), not only proportionally to the image edge size. Thus the memory consumption for plans can be considerably high.

robeverest / cufft

example for R2C transform #11