Double precision is not working... Optix BVH ?

TiernoGs commented 1 year ago

Hello,

I've been trying to work around the double precision version of PBRT and I keep having black rendered images. Some of these problems bring me to issue #329, where the optix BVH, although correctly built, doesn't work correctly with the different versions of driver - CUDA - Optix...

To expose my problem, here is the dragon image rendered using PBRT with double precision:

dragon_double

And here is the rendered image with floating point precision...

dragon_float

As previously said in some other post, rays are being launched without error but intersections are missing. Since the unified memory bug with triangles has been fixed, I think that this problem may not come from mismatching versions of CUDA and Optix but probably from the handling of double precision variables...

A few questions arise:

Does Optix support double precision ?
The BVH pointers are fine, is there an alignment issue with doubles ?
Can we test a different BVH ?
Is there something specific to the closesthit programs that double precision PBRT is missing ?

Tested in these configurations :

CUDA 11.3 - Optix 7.5 - RTX2060 SUPER
CUDA 11.7.1 - Optix 7.5 - RTX3080 Ti

Any help is welcome ! Thank you for all your work :)

Scene files and images: dragon.zip

Rendering profiles:

Wavefront Kernel Profile:
  Generate camera rays                                  1 launches     39.43 ms /   0.1% (avg 39.430, min 39.430, max  39.430)
  Generate ray samples - SobolSampler                   3 launches    103.93 ms /   0.4% (avg 34.643, min 21.593, max  47.909)
  Trace closest hit rays                                3 launches  19195.91 ms /  66.2% (avg 6398.637, min 2398.788, max 8424.610)
  DiffuseMaterial + BxDF eval (Basic tex)               2 launches    113.38 ms /   0.4% (avg 56.691, min 52.585, max  60.797)
  Trace shadow rays                                     2 launches   9518.85 ms /  32.8% (avg 4759.426, min 3293.162, max 6225.690)
  Other                                                13 launches     21.41 ms /   0.1% (avg  1.647)

Total rendering time:  28992.91 ms

Wavefront integrator statistics:
    Camera rays                                                    230400
    Indirect rays, depth 1                                         230400
    Indirect rays, depth 2                                          61621
    Shadow rays, depth 0                                           171277
    Shadow rays, depth 1                                            86217

pbrt4bounty commented 1 year ago

Hi.. in my test here, remove #329 don't fix this issue

pbrt4bounty commented 1 year ago

@TiernoGs can you try to render the scene shared by Matt in the #329 issue to check if this problem is also related the use of .ply files?

# Cornell Box Scene
# Normal view
LookAt 0 56 308
    0 35 0
    0 1 0
# Zoom in on mirror ball
#LookAt 0 56 308  -23 17 50  0 1 0
#Camera "perspective" "float fov" [10]
# Zoom in on glass ball
#LookAt 0 56 308  23 17 85 0 1 0
#Camera "perspective" "float fov" [10]
#Integrator "whitted"
Integrator "volpath"
#Integrator "photonmap" "float maxdist" [10.0]
Camera "perspective"
    "float fov" [ 30 ]
Film "rgb"
    "integer yresolution" [ 384 ]
    "integer xresolution" [ 512 ]

WorldBegin

AttributeBegin
    Translate 0 84.99 85
    AreaLightSource "diffuse"
        "rgb L" [ 20 20 20 ]
    Material "diffuse"
        "rgb reflectance" [ 0 0 0 ]
    Shape "trianglemesh"
        "point3 P" [ -17 0 -17 17 0 -17 17 0 17 -17 0 17 ]
        "integer indices" [ 0 1 2 2 3 0 ]
AttributeEnd

AttributeBegin
    Material "dielectric"
        "float eta" [ 1.5 ]
    Translate 23 17 85
    Shape "sphere"
        "float radius" [ 17 ]
AttributeEnd

AttributeBegin
    Material "conductor"
        "spectrum k" [ "metal-Ag-k" ]
        "spectrum eta" [ "metal-Ag-eta" ]
        "float roughness" [ 0 ]
    Translate -23 17 50
    Shape "sphere"
        "float radius" [ 17 ]
AttributeEnd

AttributeBegin
    # ceiling
    #floor
    # back wall
    # behind camera wall
    # left wall
    # right wall
    Material "diffuse"
        "rgb reflectance" [ 0.85 0.85 0.85 ]
    Shape "trianglemesh"
        "point3 P" [ -50 85 0 50 85 0 50 85 350 -50 85 350 ]
        "integer indices" [ 0 1 2 2 3 0 ]
    Shape "trianglemesh"
        "point3 P" [ -50 0 0 -50 0 350 50 0 350 50 0 0 ]
        "integer indices" [ 0 1 2 2 3 0 ]
    Shape "trianglemesh"
        "point3 P" [ -50 85 0 -50 0 0 50 0 0 50 85 0 ]
        "integer indices" [ 0 1 2 2 3 0 ]
    Shape "trianglemesh"
        "point3 P" [ -50 85 350 50 85 350 50 0 350 -50 0 350 ]
        "integer indices" [ 0 1 2 2 3 0 ]
    Material "diffuse"
        "rgb reflectance" [ 0.9 0.6 0.6 ]
    Shape "trianglemesh"
        "point3 P" [ -50 0 0 -50 85 0 -50 85 350 -50 0 350 ]
        "integer indices" [ 0 1 2 2 3 0 ]
    Material "diffuse"
        "rgb reflectance" [ 0.6 0.6 0.9 ]
    Shape "trianglemesh"
        "point3 P" [ 50 0 350 50 85 350 50 85 0 50 0 0 ]
        "integer indices" [ 0 1 2 2 3 0 ]
AttributeEnd

Here this scene is rendered OK with PBRT_FLOAT_AS_DOUBLE=ON pbrt.exe --log-level "verbose" --log-file c:/apps/pbrt4/debug.txt --gpu --gpu-device 0 --spp 4 Cheers..!

TiernoGs commented 1 year ago

I tested the scene with the same options --gpu --gpu-device 0 --spp 4, but strangely, I get these error messages from Optix, The render still remains black in double precision and ok with floats:

Images rendered in both .exr and .png file formats. Nothing changes as the PBRT_DBG messages (L[4] array) still print a null radiance.

Here is the shell output messages when running the double version in release:

( 253,   0,   0) error: encountered an invalid ray:
       origin:          [nan, nan, nan]
       direction:       [nan, nan, nan]
       tmin:            0.000000
       tmax:            1000000015047466219876688855040.000000
       rayTime:         0.000000
       call location:   No source location available. The input PTX may not contain debug information (nvcc option: -lineinfo), OptixModuleCompileOptions::debugLevel set to OPTIX_COMPILE_DEBUG_LEVEL_NONE, or no useful information is present for the current block.
WARNING: additional exceptions ignored
(16293,   0,   0) error: encountered an invalid ray:
(16307,   0,   0) error: encountered an invalid ray:
(16251,   0,   0) error: encountered an invalid ray:
(16210,   0,   0) error: encountered an invalid ray:
(16220,   0,   0) error: encountered an invalid ray:
       origin:          [nan, nan, nan]
       direction:       [nan, nan, nan]
       tmin:            0.000000
       tmax:            1000000015047466219876688855040.000000
       rayTime:         0.000000
       call location:   No source location available. The input PTX may not contain debug information (nvcc option: -lineinfo), OptixModuleCompileOptions::debugLevel set to OPTIX_COMPILE_DEBUG_LEVEL_NONE, or no useful information is present for the current block.

Double
Float

Here is the debug text file (looking just right): debug.txt

I tested rendering using the --wavefront option (with both versions). > Test is OK

I then decided to compile the debug version with the OPTIX_COMPILE_DEBUG_LEVEL_FULL debugLevel and run it under the compute-sanitizer. Here are the logs. Apparently, it seems to be coming from the EvaluateMaterialAndBSDFs kernel. compute-sanitizer.txt

And then, by running it on the debug version, I get these assertions.

D:\PBRT\src\pbrt/util/vecmath.h:268: block: [12,0,0], thread: [65,0,0] Assertion `!HasNaN()` failed.
D:\PBRT\src\pbrt/util/vecmath.h:268: block: [4,0,0], thread: [34,0,0] Assertion `!HasNaN()` failed.
D:\PBRT\src\pbrt/util/vecmath.h:268: block: [6,0,0], thread: [42,0,0] Assertion `!HasNaN()` failed.
D:\PBRT\src\pbrt/util/vecmath.h:268: block: [4,0,0], thread: [192,0,0] Assertion `!HasNaN()` failed.
D:\PBRT\src\pbrt/util/vecmath.h:268: block: [14,0,0], thread: [64,0,0] Assertion `!HasNaN()` failed.
D:\PBRT\src\pbrt/util/vecmath.h:268: block: [13,0,0], thread: [160,0,0] Assertion `!HasNaN()` failed.
D:\PBRT\src\pbrt/util/vecmath.h:268: block: [20,0,0], thread: [134,0,0] Assertion `!HasNaN()` failed.
D:\PBRT\src\pbrt/util/vecmath.h:268: block: [4,0,0], thread: [69,0,0] Assertion `!HasNaN()` failed.
D:\PBRT\src\pbrt/util/vecmath.h:268: block: [4,0,0], thread: [73,0,0] Assertion `!HasNaN()` failed.

I kept exploring deeper into the code and found that the part giving this strange issue was coming from this chunk of code: Point3f pc = movingFromCamera.ApplyInverse(Point3f(w.pi)); with pcgiving sometimes overflowed values... Maybe I am totally wrong but I'll try digging further tomorrow evening.

Thanks for the feedback !

pbrt4bounty commented 1 year ago

Is possible that you need to review the installed CUDA & Optix combo.. Here i'm using Cuda 11.6 + Optix 7.4

TiernoGs commented 1 year ago

Ok, I tested Pbrt with CUDA 11.6 and Optix 7.4. I get the same assertions but this time, values are not Nans. The rendered image is still black but when priting bsdf (using PBRT_DBG_LOGGING), I get correct Conductor/Dielectric BxDf values so that's a progress. Pc values however still give overflowed values in the GenerateCameraRays program.

When printing radiance values from the UpdateFilm method, I still get these null values:

Adding Lw 0.000000 0.000000 0.000000 0.000000 at pixel (24, 182)
Adding Lw 0.000000 0.000000 0.000000 0.000000 at pixel (25, 182)
Adding Lw 0.000000 0.000000 0.000000 0.000000 at pixel (26, 182)
Adding Lw 0.000000 0.000000 0.000000 0.000000 at pixel (27, 182)
Adding Lw 0.000000 0.000000 0.000000 0.000000 at pixel (28, 182)
Adding Lw 0.000000 0.000000 0.000000 0.000000 at pixel (29, 182)
Adding Lw 0.000000 0.000000 0.000000 0.000000 at pixel (30, 182)
Adding Lw 0.000000 0.000000 0.000000 0.000000 at pixel (31, 182)

Result is not being affected by the CUDA Optix combo change, I tried with Optix 7.6 and kept getting these black images. Is there something I'm missing ?

pbrt4bounty commented 1 year ago

Interesting... I'm too busy to do more tests, but I'll try it next weekend

mmp commented 1 year ago

I strongly suspect that the issue is in this code in src/pbrt/gpu/aggregate.cpp (and possibly in other related places in that file):

            input.triangleArray.vertexFormat = OPTIX_VERTEX_FORMAT_FLOAT3;
            input.triangleArray.vertexStrideInBytes = sizeof(Point3f);
            input.triangleArray.numVertices = mesh->nVertices;
            Point3f *pGPU;
            CUDA_CHECK(cudaMalloc(&pGPU, mesh->nVertices * sizeof(Point3f)));
            CUDA_CHECK(cudaMemcpy(pGPU, mesh->p, mesh->nVertices * sizeof(Point3f),
                                  cudaMemcpyHostToDevice));

OptiX only takes float32 and doesn't support double precision. However, with PBRT_FLOAT_AS_DOUBLE, Point3f is 3 doubles, yet it's being copied directly and given to OptiX, which assumes it is floats.

I'm unable to dig into this further for another week or two but it should be a straightforward fix; allocate an array of floats of sufficient size, copy the Point3f values into that, then give that to OptiX instead. (And make sure that input.triangleArray.vertexStrideInBytes is 3*sizeof(float)==12.) From a quick skim of that file I don't see any other suspicious places that do similar invalid casts, so that might be enough to do it.

TiernoGs commented 1 year ago

Thank you for your two replies. I managed to isolate part of the issue in the optix.cu file. When launching the __raygen__findClosest program, I found that the optixTrace method was always giving missing intersections, thus allowing missed rays to be generated (and then calling the HandleEscapedRays method).

As a result, I tried rendering the same scene with a default infinite light (by adding this line to the pbrt file):

LightSource "infinite"

Here is the rendered image (EXR converted to png, both spheres can be seen): renderDoubleInfinite

(spp1, jitter on wavelengths is active)

@mmp , I modified the triangleArray setup, by using the CUDATracked memory resource to create the pGPU array and made it leak on purpose. >> It gave me the same render.

Since I'm getting an image from missed rays only, there must be something between the BVH construction and the Optix traversable at runtime.

Still searching... I'll keep you updated :)

TiernoGs commented 1 year ago

Coming back with some updates ! Apparently, using simple floating point precision Point3f works with a 3080 Ti. I had to reinstall Cuda and Optix from scratch. Plus, it seems to be working with CUDA 11.6 and Optix 7.4 only (haven't found the reason yet as any other combination fails).

However, rendering with a 2060 still gives the previous image. Since, it is working with the 3080, I'll consider it resolved :) Thanks again for the help !

mmp commented 1 year ago

I have fixed the code in pbrt/gpu/aggregate.cpp for PBRT_FLOAT_AS_DOUBLE in case others run into this...

mmp / pbrt-v4

Double precision is not working... Optix BVH ? #359