Request: Support sm_52 devices?

mmp / pbrt-v4

Source code to pbrt, the ray tracer described in the forthcoming 4th edition of the "Physically Based Rendering: From Theory to Implementation" book.

https://pbrt.org

Apache License 2.0

2.87k stars 448 forks source link

Request: Support sm_52 devices? #292

Closed oscarbg closed 2 years ago

oscarbg commented 2 years ago

Hi,

I tried to build pbrt-v4 on a system with GTX 970 (Maxwell GPU).. seems unsupported claiming sm_53 minimum..

it's a little unfortunate since this devices are still supported by Optix 7.5 (latest versions): "● All NVIDIA GPUs of Compute Capability 5.0 (Maxwell) or higher are supported"

hacked the build and seems getting lots of half errors in CUDA code.. *question is: how much effort is needed to enable support on this devices?

I mean can't be a question of performance as even sm_60 devices (Pascal) had abismal low FP16 performance (many times slower than FP32) at least the geforce parts not GP100.. little FP16 units were added only for developer playing with FP16 early..

all the GPU performance gained vs CPU from PBRT is due to Optix, right?

thanks..

mmp commented 2 years ago

I'd happily accept a patch that added support for sm_52 but am not able to put one together myself (both due to time and not having a Maxwell GPU anymore.) I'm happy to try to give feedback if you want to try and run into any issues.

The GPU performance win is both from OptiX accelerating the ray intersection tests and from running all of the other code (sampling, BRDF eval, lighting, MC integration, etc.,) on the GPU, which generally offers more FLOPS and higher bandwidth than CPUs...

oscarbg commented 2 years ago

Hi, thanks for info & offering help .. think will try later, not now..

just one question are you using builtin half2 variables or mostly scalar half variables or vectors of half on CUDA code.. say because as seen on forum: https://forums.developer.nvidia.com/t/poor-half-performance/111626: There is no 16-bit register on the device. They are all 32-bits. To get double performance of a float, you need to use half2. Where you package two half precision variables in one register. And then use the appropriate intrinsic might be easier to "emulate" as float seems..

mmp commented 2 years ago

I believe that the only use of half in pbrt-v4 is for image maps (e.g. OpenEXRs used for HDR lighting). All that is handled with the Half class in util/float.h. For sm_52 support it might just work to change the #ifdef PBRT_IS_GPU_CODE to something like PBRT_GPU_HAS_HALF and then have that set for > sm_52 and unset for the CPU and for sm_52. There are just a few of those #ifdefs, so it should be possible to quickly check and see if that does it.

oscarbg commented 2 years ago

Hi @mmp, thanks for the info.. happy to report a one line change did the trick! (it was the unique error):

    bool operator==(const Half &v) const {
#ifdef PBRT_IS_GPU_CODE
        //return __ushort_as_half(h) == __ushort_as_half(v.h);
        return h==v.h;

of course changing : float min_cc = 5.3; to float min_cc = 5.2; in checkcuda.cu

rendered killeroo simple and gold and are visually correct..

there is a pbrt_test with a suite of tests,but it seems cpu only,right?

can run some tests if you want to verify correctness but seems a harmless change anyway..

hope you are willing to merge this simple changes..

thanks..

mmp commented 2 years ago

I've pushed a slight modification to that change that I believe will also take care of things; can you let me know when you have a chance? (I'm also curious about what sort of performance you're seeing on a GPU from that era.)

shadeops commented 2 years ago

I can confirm this change allows GPU mode to run a GTX 970M.

For a horrible comparison but better than nothing, a quick test using pbrt-v4-scenes book.pbrt: pbrt --spp 64 pbrt-v4-scenes/pbrt-book/book.pbrt

Ubuntu 20.04 RTX 3080, AMD Ryzen 9 3950X CPU: 41.2s GPU: 1.8s

Windows 10 GTX 970M, Intel i7 6700HQ CPU: 581.6s GPU: 36.4s

As a unexplored anecdote - My first render on the GTX 970M crashed the Nvidia display driver, but after updating from CUDA 11.0 to 11.4.3 and rebuilding pbrt it ran without issue. I don't know if the crash was a coincidence so don't read too much into it.

mmp commented 2 years ago

Thanks for the confirmation and that's an interesting datapoint! The 3950X CPU is about 15% faster than the 970M GPU for that scene and as it turns out, the 3950X has 3.6 TFLOPS peak and the GTX 970M has 2.7 TFLOPS peak, so 30% more for the CPU. Thus, in this case raw processor FLOPS roughly tracks the runtime, regardless of the type of processor.

I'm not sure what if any conclusions to draw from that in regards to pbrt's efficiency, etc., but it's always interesting when the numbers line up like that.

oscarbg commented 2 years ago

Hi, thanks @mmp for making a "real" fix and merging it.. and @shadeops for testing also and providing benchmark data.. I would provide also data, but must say that they day I made the fix all worked correctly, but since then also have had also a few hangs testing other scenes.. but I would bet on Optix runtime Maxwell bugs not pbrt.. lazy now, but eventually will build in debug mode, and if can catch where the hang occurs, will update.. think we can close for now..