Add support for CUDA 5 features: dynamic parallelism etc..

For object linking, currently NVCC has only announced support for CUBIN 
linking.  They will also support PTX linking in the future.  At that time, 
Ocelot should support it with only minor changes, so I plan to wait for that 
feature.  In the meantime, it is possible to 'link' PTX files together by 
simply concatenating them.

Dynamic parallelism should be supported by default on the NVIDIA devices since 
device code contains the kernel launch and interacts directly with the GPU 
driver.  

There is experimental support for asynchronous dynamic parallelism in the 
emulator and LLVM backend via a user-level library that simply calls cudaLaunch 
from a different user-pthread.  We plan to move this functionality into the 
CUDA runtime.  We also need support for synchronous dynamic parallelism, which 
should be relatively easy, but still needs some implementation work.  Of course 
both of these need unit tests.

I'm not sure what the status is on the AMD backend.

Original comment by SolusStu...@gmail.com on 30 May 2012 at 3:57

Added labels: Priority-High, Type-Enhancement
Removed labels: Priority-Medium, Type-Defect

nguyenminhduc9988 / gpuocelot

Add support for CUDA 5 features: dynamic parallelism etc.. #68