Problems with (not only) GCC builds

Vutshi commented 3 years ago

Hi,

I see strange behaviour of recent pbrt versions build with gcc (in ubuntu):

gcc-8.4 has stuck forever on ZSobolSampler.ValidIndices test
gcc-9.3 passed all tests

gcc-10.3 gives me the following message (in an eternal cycle) during tests

/src/pbrt/samplers_test.cpp:182: Failure
Value of: returnedIndices.find(index) == returnedIndices.end()
Actual: false
Expected: true

Furthermore, the output images rendered with a fixed seed are slightly different for these three versions of GCC. And they all differ from clang build pbrt in macOS.

I used pavilion-night.pbrt with regularised BDPT integrator and 1024 spp for the tests.

UPDATE: I have tested more compilers.

Clang-12 on Ubuntu 20.04 comes very close to my clang-12 result on macOS 11.5.2. They agree much better than GCC and clang.
Also I have results from non-intel VLIW cpu (Elbrus) and its proprietary compiler LCC which agrees even better with macOS clang, although it is still not a perfect match.

Should we expect perfect reproducibility with a fixed seed or some variations are still allowed?

Vutshi commented 3 years ago

Maximum absolute error:
Clang-12 macOS vs Lcc 1.25 Elbrus
0.613

Clang-12 macOS vs Clang-12 Ubuntu
2.012

Clang-12 macOS vs Gcc-10 Ubuntu
10.875

Clang-12 macOS vs Lcc 1.25 Elbrus Clang-12 macOS vs Lcc 1 25 Elbrus

Clang-12 macOS vs Clang-12 Ubuntu

Clang-12 macOS vs Gcc-10 Ubuntu

Gcc-10 vs gcc-9

Gcc-10 vs gcc-8

trevordblack commented 3 years ago

Were these run on gpu with the --gpu option?

Do you have any idea where the rand is coming from?

My educated guess is that compilers will differ subtly on the implementation of standard libraries. But I'm not seeing much (if any of that) in the sources of random for your scene: Halton, bdpt.

the uniform rng code from src/pbrt/util/rng.h doesn't have anything that should be compiler dependent (unless they're doing sneaky stuff with fp rounding errors)

The remaining rng seems to stem from src/pbrt/util/lowdiscrepancy.h

Vutshi commented 3 years ago

These are CPU only runs. Implementation of standard libraries is unlikely to be the problem here, at least it doesn't explain problems with tests. furthermore, when I tried something tiny like smallpt or smallmmlt renderers I always got exact match even between intel and non-x86 platforms if the same compilation options are used.

mmp commented 3 years ago

Thanks for reporting this (and sorry for not chasing it down sooner.)

On those pbrt_test issues: that turned out to be a signed integer multiplication that was overflowing, which is undefined behavior in C++. It seems that the compiler effectively decided that the value of dim in this loop: https://github.com/mmp/pbrt-v4/blob/21115e8116790f06fd5b690f04e0b017b2166c37/src/pbrt/samplers_test.cpp#L173 could only ever be 0 (since for dim=3 and beyond, the overflow would happen), and so that loop was turned into an infinite loop. Yaay for undefined behavior in C++. That has now been fixed.

On those image differences, it is not expected that things will 100% match across compilers. One issue is that different math libraries may return different results for transcendental functions. Another is that there are cases in the code where order of evaluation is compiler defined--stuff like func(generateRandomNumber(), generateRandomNumber()) where the sequence of those two calls is compiler-defined. (Most of those have been fixed, but there are probably still some.)

The important question is whether they converge to the same image (and do so at the same rate); I've done some tests with that scene and those three compilers, and the differences are extremely minor numerically and so in the realm of what is expected.

Vutshi commented 3 years ago

It seems PBRT-v5 asks to be written in Rust :)

mmp / pbrt-v4

Problems with (not only) GCC builds #177