mmp / pbrt-v4

Source code to pbrt, the ray tracer described in the forthcoming 4th edition of the "Physically Based Rendering: From Theory to Implementation" book.
https://pbrt.org
Apache License 2.0
2.82k stars 434 forks source link

Support for multiple GPU SM versions #32

Open mmp opened 4 years ago

mmp commented 4 years ago

pbrt currently reports "FATAL CUDA error: invalid device symbol" and dumps a stack trace if its GPU path is run on a GPU that doesn't support the SM version it was compiled for. If nothing else, that's a pretty obscure error message; it would nice to say something more descriptive.

More generally, there's the question about whether the build should be improved so this doesn't happen. One option would be to just compile to PTX. Alternatively, the cmake/checkcuda.cu program currently reports a single SM version, and that of the first GPU that was detected. If multiple GPUs were installed, we might compile for each of them. Or perhaps we should allow the user to specify one or more SM versions, so that they could build for multiple SM versions even if they didn't have corresponding GPUs in their system at the moment...

Building the GPU part of the system is fairly slow already, however, so it's not attractive to add more work to that phase of compilation...

pierremoreau commented 4 years ago

In CMake 3.18, CMAKE_CUDA_ARCHITECTURES was added which gives a way to the user to specify for which architectures to compile (and CMake will automatically forward those values to any CUDA file being compiled). cmake/checkcuda.cu could be changed to return the SM versions of all detected GPUs, and that string gets used if CMAKE_CUDA_ARCHITECTURES was not defined by the user.

digital-pro commented 2 years ago

Has anyone done this successfully and have insights to share? We have dockerized pbrt-v4 for GPUs, and made it possible for our students and researchers to remotely render scenes on our servers that have GPUs. Which, by the way, is awesome, and puts us miles ahead of where we were with v3. However, some of our servers have GPUs with multiple architectures, and we've been unable to build a binary (and in turn a Docker image, since we Dockerize everything) that can run on any GPU architecture other than the primary GPU. I've tried fiddling with CMAKE_CUDA_ARCHITECTURE and some other tweaks, but haven't gotten anything to work. Thanks! -- David Cardinal, Vistalab, Stanford

mmp commented 2 years ago

We have dockerized pbrt-v4 for GPUs, and made it possible for our students and researchers to remotely render scenes on our servers that have GPUs.

Cool!

If I do something like this:

% cmake -G Ninja -DPBRT_OPTIX7_PATH=~/optix-7.4.0 ~/pbrt-v4 -DPBRT_GPU_SHADER_MODEL=sm_60

I am able to build a binary that is, as far as I can tell, compiled with the flags to specify shader model 6.0. Is your issue being unable to compile with a specified shader model, unable to compile a single binary with multiple shader models, or finding that the binary is invalid in spite of the above?

FWIW I haven't been able to figure out how to compile a single binary that supports multiple shader models.

digital-pro commented 2 years ago

Matt -- Thanks!! It seems to be working. I can build our docker image on the same Linux server for both our 3070 and the 2080 ti's that we were lucky enough to have donated:) That means we have at least 3 GPUs live that people can render on, even if their personal machine is a low-end box. I'm especially happy, as my major area of interest is computational photography, so bursts of images are needed. 

No further info on how to make a single binary for multiple architectures. Do you think any of the new -arch flags could help with that, or maybe they're not relevant for the pbrt compilation pipeline. 

In any case, this is great progress. Thanks!

-- David

mmp commented 2 years ago

Great! As far as I can tell a single binary for multiple architectures should be possible via the "fat binary" functionality of nvcc, but I'm not sure how to wire that up with the cmake stuff. Another issue is that pbrt's OptiX kernels would need to be handled similarly, which I'm not sure how to do either. Anyway, something to hopefully be fixed someday, but glad you're set for now.

pierremoreau commented 2 years ago

From what I understand from the doc, specifying CMAKE_CUDA_ARCHITECTURES="70-real;72-real" will have CMake automatically forward those architectures to nvcc and build for those (I am guessing using its “fat binary” functionality). That would however require bumping to CMake 3.18, but I do not think it would be too difficult to emulate that feature on our own.

Now for the OptiX kernels, compiling to multiple SM versions using CMake should not be too hard but I am not sure how the binaries would be specified for the applications to load them as expected.

mmp commented 2 years ago

Ah, that's helpful. It looks like CMake 3.18 was released in 2020, though, which would require many folks to manually upgrade, which is somewhat undesirable for everyone who doesn't need this functionality.

The OptiX kernels are basically compiled to PTX and then encoded as a big string that's stored in a global variable that's linked into the executable:

extern const unsigned char PBRT_EMBEDDED_PTX[];

That string is passed in to the OptiX API. So "all" that would be necessary there I think would be to do that step multiple times, with different --gpu-architecture settings, have unique variable names, and then choose the appropriate string at for the architecture being used runtime.

pierremoreau commented 2 years ago

How would the unique naming work? Would src/pbrt/gpu/aggregrate.cpp need to to contain something like the following?

extern "C" {
extern const unsigned char PBRT_EMBEDDED_PTX_SM30[];
extern const unsigned char PBRT_EMBEDDED_PTX_SM50[];
extern const unsigned char PBRT_EMBEDDED_PTX_SM60[];
extern const unsigned char PBRT_EMBEDDED_PTX_SM70[];
extern const unsigned char PBRT_EMBEDDED_PTX_SM71[];
extern const unsigned char PBRT_EMBEDDED_PTX_SM72[];
}

And then when calling createOptiXModule(), the right variable would be passed in?

mmp commented 2 years ago

Something like that. Come to think of it another option might be for aggregate.cpp to have an extern std::map<std::string, const char *>> archPTX declaration and to then attempt lookups in that with the selected GPU's architecture. Then the build could automatically generate a .cpp file that had something like:

extern const unsigned char PBRT_EMBEDDED_PTX_SM80[];
// ...

std::map<std::string, const unsigned char *>> archPTX {
    { "sm80", PTX_EMBEDDED_PTX_SM80 },
    // ...
};