Modify curand_init for 2x performance improvement

rogerallen commented 5 years ago

Change

curand_init(1984, pixel_index, 0, &rand_state[pixel_index]);

to

curand_init(1984+pixel_index, 0, 0, &rand_state[pixel_index]);

for 2x speedup. Some info at: https://docs.nvidia.com/cuda/curand/device-api-overview.html#performance-notes

The first call has a fixed random seed, different sequence ids and a fixed offset into that sequence. That creates a different sequence per thread (more overhead).

The second call has different seeds and the same sequence & offset. I think this means it only generates one sequence for all the threads & the different seeds allow for enough randomness without so much overhead.

I had tried this out when I originally created the code, but read the instructions too quickly & messed up. I modified the 3rd parameter, not the first. Doh!

artmortal93 commented 4 years ago

Hi there, i also found that using the old syntax: curand_init(1984, pixel_index, 0, &rand_state[pixel_index]); will generate a error to make cudaDeviceSynchornize return error code 4, but changing to the new syntax everything get along well. I am testing the code on GTX 1060 on windows, it seems that it's related to below post: https://stackoverflow.com/questions/42607562/curand-init-produces-memory-access-errors

rogerallen commented 4 years ago

Thanks for the report. That stackoverflow discussion resolved that nSight debugger was at fault. Is that your case as well?

r-gr commented 4 years ago

In my case, changing the original line to curand_init(1984+pixel_index, 0, 0, &rand_state[pixel_index]); causes a cudaErrorLaunchFailure (719) at run time in the cudaDeviceSynchronize() after the render<<<>>>() call.

The original version executes successfully without any issue.

Bizarrely, after changing the line back to the original version, curand_init(1984, pixel_index, 0, &rand_state[pixel_index]);, recompiling and running, the CUDA error 719 persists in the same place every time until I do one of a few things:

reboot my machine
wait about 15-20 minutes
reduce the image dimensions (e.g. 800x450 -> 400x225)
reduce the number of samples per pixel (e.g. 100 -> 60)

In the first two cases, the original version then runs successfully every time; in the latter two cases, reverting those changes results in the error again, even for the original version.

I have absolutely no idea what is going on here but it seems like the modified version of the curand_init() call breaks something to do with the device's memory at the driver level, so rebooting or waiting for some background cleanup operation to run fixes whatever was broken.

Sorry I can't be more specific, I've only just started learning about this stuff and I'm not yet familiar with the debugging tools etc. I also normally do all development work on Unix-y systems but I've been doing this on Windows. Happy to dig further into this if anyone can point me in the right direction.

Windows 10, GTX 760 2GB, compiling with the Visual Studio 2019 CUDA tools.

rogerallen commented 4 years ago

@r-gr Since it seems related to the complexity of the shader, can you look at whether TDR timeouts might be the issue? See https://github.com/rogerallen/raytracinginoneweekendincuda/issues/7#issuecomment-632712533

An error 719 does seem rather serious and could indicate that maybe this code doesn't work well on Kepler (GK10x) class chips? I personally have not tried. https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html

r-gr commented 4 years ago

@rogerallen Increasing the timeout does appear to allow the modified version to execute successfully. Thanks!

This might be outside the scope of this discussion but this finding immediately leads to the question: how might one go about writing software which utilises the GPU for computation but which is stable across the full range of compatible hardware?

When writing software to run on the CPU under an OS, it's taken for granted that no matter how complex and long-running the computations, the OS scheduler will have some fairness mechanism so that one process doesn't starve all the others of resources. Is there any such mechanism for the GPU?

In the context of this ray tracer, I could manually split the render<<<>>>() call into multiple calls, each covering a smaller block of pixels... But is there a programmatic way to split up GPU computations so that they run in reasonably sized chunks in order to best utilise the most powerful hardware while avoiding hitting the timeout limits on weaker hardware?

Edit: reading the Blender documentation on GPU rendering, it seems to imply that it's really a matter of hand-picking a sensible amount of computations to send the GPU at one time (the tile size). If I'm not just completely wrong and misunderstanding this, is anyone working on some kind of fairness-based work scheduling system for GPUs? Would such a system even make sense in existing hardware or would there be issues like context switching being prohibitively expensive?

rogerallen commented 4 years ago

Glad to hear that helped.

Yeah, this is a rather involved discussion and answers will depend on details of your program. You are basically on the right track, but this seems like a better topic for NVIDIA CUDA Developer forums. https://forums.developer.nvidia.com/c/accelerated-computing/cuda/206

I will say that NVIDIA GPUs do have improved context switching capabilities since the Kepler days. E.g. https://www.anandtech.com/show/10325/the-nvidia-geforce-gtx-1080-and-1070-founders-edition-review/10

rogerallen commented 3 years ago

Resolved in the ch12_where_next_cuda branch. Leaving open as this is not fixed in all branches.

rogerallen / raytracinginoneweekendincuda

Modify curand_init for 2x performance improvement #2