Open rogerallen opened 5 years ago
Hi there, i also found that using the old syntax:
curand_init(1984, pixel_index, 0, &rand_state[pixel_index]);
will generate a error to make cudaDeviceSynchornize return error code 4, but changing to the new syntax everything get along well. I am testing the code on GTX 1060 on windows, it seems that it's related to below post:
https://stackoverflow.com/questions/42607562/curand-init-produces-memory-access-errors
Thanks for the report. That stackoverflow discussion resolved that nSight debugger was at fault. Is that your case as well?
In my case, changing the original line to
curand_init(1984+pixel_index, 0, 0, &rand_state[pixel_index]);
causes a cudaErrorLaunchFailure
(719) at run time in the cudaDeviceSynchronize()
after the render<<<>>>()
call.
The original version executes successfully without any issue.
Bizarrely, after changing the line back to the original version,
curand_init(1984, pixel_index, 0, &rand_state[pixel_index]);
,
recompiling and running, the CUDA error 719 persists in the same place every time until I do one of a few things:
In the first two cases, the original version then runs successfully every time; in the latter two cases, reverting those changes results in the error again, even for the original version.
I have absolutely no idea what is going on here but it seems like the modified version of the curand_init()
call breaks something to do with the device's memory at the driver level, so rebooting or waiting for some background cleanup operation to run fixes whatever was broken.
Sorry I can't be more specific, I've only just started learning about this stuff and I'm not yet familiar with the debugging tools etc. I also normally do all development work on Unix-y systems but I've been doing this on Windows. Happy to dig further into this if anyone can point me in the right direction.
Windows 10, GTX 760 2GB, compiling with the Visual Studio 2019 CUDA tools.
@r-gr Since it seems related to the complexity of the shader, can you look at whether TDR timeouts might be the issue? See https://github.com/rogerallen/raytracinginoneweekendincuda/issues/7#issuecomment-632712533
An error 719 does seem rather serious and could indicate that maybe this code doesn't work well on Kepler (GK10x) class chips? I personally have not tried. https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html
@rogerallen Increasing the timeout does appear to allow the modified version to execute successfully. Thanks!
This might be outside the scope of this discussion but this finding immediately leads to the question: how might one go about writing software which utilises the GPU for computation but which is stable across the full range of compatible hardware?
When writing software to run on the CPU under an OS, it's taken for granted that no matter how complex and long-running the computations, the OS scheduler will have some fairness mechanism so that one process doesn't starve all the others of resources. Is there any such mechanism for the GPU?
In the context of this ray tracer, I could manually split the render<<<>>>()
call into multiple calls, each covering a smaller block of pixels... But is there a programmatic way to split up GPU computations so that they run in reasonably sized chunks in order to best utilise the most powerful hardware while avoiding hitting the timeout limits on weaker hardware?
Edit: reading the Blender documentation on GPU rendering, it seems to imply that it's really a matter of hand-picking a sensible amount of computations to send the GPU at one time (the tile size). If I'm not just completely wrong and misunderstanding this, is anyone working on some kind of fairness-based work scheduling system for GPUs? Would such a system even make sense in existing hardware or would there be issues like context switching being prohibitively expensive?
Glad to hear that helped.
Yeah, this is a rather involved discussion and answers will depend on details of your program. You are basically on the right track, but this seems like a better topic for NVIDIA CUDA Developer forums. https://forums.developer.nvidia.com/c/accelerated-computing/cuda/206
I will say that NVIDIA GPUs do have improved context switching capabilities since the Kepler days. E.g. https://www.anandtech.com/show/10325/the-nvidia-geforce-gtx-1080-and-1070-founders-edition-review/10
Resolved in the ch12_where_next_cuda branch. Leaving open as this is not fixed in all branches.
Change
curand_init(1984, pixel_index, 0, &rand_state[pixel_index]);
to
curand_init(1984+pixel_index, 0, 0, &rand_state[pixel_index]);
for 2x speedup. Some info at: https://docs.nvidia.com/cuda/curand/device-api-overview.html#performance-notes
The first call has a fixed random seed, different sequence ids and a fixed offset into that sequence. That creates a different sequence per thread (more overhead).
The second call has different seeds and the same sequence & offset. I think this means it only generates one sequence for all the threads & the different seeds allow for enough randomness without so much overhead.
I had tried this out when I originally created the code, but read the instructions too quickly & messed up. I modified the 3rd parameter, not the first. Doh!