Closed winterismute closed 1 year ago
I compiled without problems on a Mac mini M2, arm64 with clang 14.0.3 under Ventura 13 but did not run any tests because the binaries work fine here. I can try later, to confirm
Actually, it was rather easy to trigger the same assert while rendering, for example rendering a scene such as villa-lights-on
with the MLT integrator, even at 1 spp should trigger it. I can post full repro instructions if you don't hit it.
Unfortunately I can't reproduce it here on an M1 mac running Monterey. Could you do a debug build of pbrt and then post the output when it crashes (for both cases.) Hopefully an earlier assertion will hit, which would make it easier to track down the problem.
@mmp could you share the clang version you use and for which everything works on Apple Silicon?
% g++ -v
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/c++/4.2.1
Apple clang version 13.0.0 (clang-1300.0.29.3)
Target: arm64-apple-darwin21.6.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
Thank you. So, I think the code might be affected by this clang change: https://releases.llvm.org/14.0.0/tools/clang/docs/ReleaseNotes.html#floating-point-support-in-clang
From 14.0.0, ffp-contract
is set to on, while before it was effectively off. This change affects ARM platforms more than x86 because, I think, x86 produces the fused ops only when some of the architectural-related optimizations are on - which I am not sure if you turn on by default either. The situation is complicated even more by the fact that GCC still, I believe, sets this to off by default for standard C++ compilation. In any case, I added to the CMakeList
list (APPEND PBRT_CXX_FLAGS "-ffp-contract=off")
then I rebuilt and all tests now pass and I can render the scene that was crashing before (essentially, villa-lights-on
with mlt set as integrator).
This being said, it is odd that the introduction of eg. fmads leads to asserts while rendering, since I believe their floating-point "accuracy" should be even better? I wonder if you can reproduce the scenario by forcing that same option to on
or fast
on either x86 or Apple Silicon, and if you are interested in debugging it.
Hope this helps!
Thank you. So, I think the code might be affected by this clang change: https://releases.llvm.org/14.0.0/tools/clang/docs/ReleaseNotes.html#floating-point-support-in-clang From 14.0.0,
ffp-contract
is set to on, while before it was effectively off. This change affects ARM platforms more than x86 because, I think, x86 produces the fused ops only when some of the architectural-related optimizations are on - which I am not sure if you turn on by default either. The situation is complicated even more by the fact that GCC still, I believe, sets this to off by default for standard C++ compilation. In any case, I added to the CMakeListlist (APPEND PBRT_CXX_FLAGS "-ffp-contract=off")
then I rebuilt and all tests now pass and I can render the scene that was crashing before (essentially,
villa-lights-on
with mlt set as integrator). This being said, it is odd that the introduction of eg. fmads leads to asserts while rendering, since I believe their floating-point "accuracy" should be even better? I wonder if you can reproduce the scenario by forcing that same option toon
orfast
on either x86 or Apple Silicon, and if you are interested in debugging it. Hope this helps!
Thank you that fixed it for me. I'm on a 2022 Macbook Pro M1 Max and had the same error. The error I had was on film.cpp:500 RGBFilm::addSplat()
CHECK(!L.HasNaNs());
I'm glad that fixed it but there still may be a bug in pbrt; changing that setting might just be masking the actual bug.
Could you please do a debug build and post pbrt's full output on one of those crashes, so that I have a chance of digging in to whether there's a pbrt bug?
I haven't dug in yet but I can confirm the problem is reproducible on my machine.
I'm running a 12th gen intel part running ubuntu linux (a 14 inch framework laptop)
trevor@tdb:~/pbrt-v4/build$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 39 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Vendor ID: GenuineIntel
Model name: 12th Gen Intel(R) Core(TM) i7-1260P
CPU family: 6
Model: 154
Thread(s) per core: 2
Core(s) per socket: 12
...
Running through the above, I'll try the fix given above, and see if I can't run a debug build to root-cause.
Where the error I run into is:
[ RUN ] AnalyticTestScenes/RenderTest.RadianceMatches/54
[ tid 37991 @ 0.000s film.cpp:500 ] FATAL Check failed: !L.HasNaNs()
(./pbrt_test ) 0x0x555a70fde92a - pbrt::PrintStackTrace() + 0x2a
(./pbrt_test ) 0x0x555a70fdeb22 - pbrt::CheckCallbackScope::Fail() + 0x12
(./pbrt_test ) 0x0x555a71032f07 - pbrt::LogFatal(pbrt::LogLevel, char const*, int, char const*) + 0x1a7
(./pbrt_test ) 0x0x555a70e23e5c - void pbrt::LogFatal<char const (&) [13]>(pbrt::LogLevel, char const*, int, char const*, char const (&) [13]) + 0x4c
(./pbrt_test ) 0x0x555a70e1cadb - (unknown) + 0x33aadb
(./pbrt_test ) 0x0x555a70fb24d5 - (unknown) + 0x4d04d5
(./pbrt_test ) 0x0x555a70ce7874 - std::_Function_handler<void (long, long), pbrt::ParallelFor(long, long, std::function<void (long)>)::{lambda(long, long)#1}>::_M_invoke(std::_Any_data const&, long&&, long&&) + 0x34
(./pbrt_test ) 0x0x555a71063352 - pbrt::ParallelForLoop1D::RunStep(std::unique_lock<std::mutex>*) + 0x92
(./pbrt_test ) 0x0x555a71062a47 - pbrt::ThreadPool::WorkOrWait(std::unique_lock<std::mutex>*, bool) + 0x67
(./pbrt_test ) 0x0x555a71062973 - pbrt::ThreadPool::Worker() + 0x63
(/lib/x86_64-linux-gnu/libstdc++.so.6 ) 0x0x7f0d4f0dc253 - (unknown) + 0xdc253
(/lib/x86_64-linux-gnu/libc.so.6 ) 0x0x7f0d4ec94b43 - (unknown) + 0x94b43
(/lib/x86_64-linux-gnu/libc.so.6 ) 0x0x7f0d4ed26a00 - (unknown) + 0x126a00
Rendering failed. Debug with --debugstart 0,0.94294167,0,0,0.7749876,0,0,0.16090861,0,0,0.076150656,0,0,0.5543,0,0,0.35725263,0"
Aborted (core dumped)
Ran with a debug build:
[ RUN ] AnalyticTestScenes/RenderTest.RadianceMatches/54
[ tid 44208 @ 0.000s util/sampling.cpp:626 ] FATAL Check failed: bins[offset].p > 0 with bins[offset].p = 0, 0 = 0
(./pbrt_test ) 0x0x5560fe0575ac - pbrt::PrintStackTrace() + 0x1c
(./pbrt_test ) 0x0x5560fe057990 - pbrt::CheckCallbackScope::Fail() + 0x10
(./pbrt_test ) 0x0x5560fe0be98e - pbrt::LogFatal(pbrt::LogLevel, char const*, int, char const*) + 0x17e
(./pbrt_test ) 0x0x5560fe0f7ed2 - void pbrt::LogFatal<char const (&) [15], char const (&) [2], char const (&) [15], float&, char const (&) [2], int&>(pbrt::LogLevel, char const*, int, char const*, char const (&) [15], char const (&) [2], char const (&) [15], float&, char const (&) [2], int&) + 0x102
(./pbrt_test ) 0x0x5560fe1126c5 - pbrt::AliasTable::Sample(float, float*, float*) const + 0x1f5
(./pbrt_test ) 0x0x5560fe00fc2e - (unknown) + 0x914c2e
(./pbrt_test ) 0x0x5560fe00fb14 - (unknown) + 0x914b14
(./pbrt_test ) 0x0x5560fe00fab2 - (unknown) + 0x914ab2
(./pbrt_test ) 0x0x5560fe00f972 - (unknown) + 0x914972
(./pbrt_test ) 0x0x5560fdc42bc5 - std::function<void (long)>::operator()(long) const + 0x55
(./pbrt_test ) 0x0x5560fdc42b52 - pbrt::ParallelFor(long, long, std::function<void (long)>)::{lambda(long, long)#1}::operator()(long, long) const + 0x42
(./pbrt_test ) 0x0x5560fdc42afa - void std::__invoke_impl<void, pbrt::ParallelFor(long, long, std::function<void (long)>)::{lambda(long, long)#1}&, long, long>(std::__invoke_other, pbrt::ParallelFor(long, long, std::function<void (long)>)::{lambda(long, long)#1}&, long&&, long&&) + 0x4a
(./pbrt_test ) 0x0x5560fdc42a67 - std::enable_if<is_invocable_r_v<void, pbrt::ParallelFor(long, long, std::function<void (long)>)::{lambda(long, long)#1}&, long, long>, void>::type std::__invoke_r<void, pbrt::ParallelFor(long, long, std::function<void (long)>)::{lambda(long, long)#1}&, long, long>(pbrt::ParallelFor(long, long, std::function<void (long)>)::{lambda(long, long)#1}&, long&&, long&&) + 0x47
(./pbrt_test ) 0x0x5560fdc42947 - std::_Function_handler<void (long, long), pbrt::ParallelFor(long, long, std::function<void (long)>)::{lambda(long, long)#1}>::_M_invoke(std::_Any_data const&, long&&, long&&) + 0x47
(./pbrt_test ) 0x0x5560fe100f6a - std::function<void (long, long)>::operator()(long, long) const + 0x6a
(./pbrt_test ) 0x0x5560fe0ffb3b - pbrt::ParallelForLoop1D::RunStep(std::unique_lock<std::mutex>*) + 0x8b
(./pbrt_test ) 0x0x5560fe0ff104 - pbrt::ThreadPool::WorkOrWait(std::unique_lock<std::mutex>*, bool) + 0xf4
(./pbrt_test ) 0x0x5560fe0fef95 - pbrt::ThreadPool::Worker() + 0x75
(./pbrt_test ) 0x0x5560fe102cee - void std::__invoke_impl<void, void (pbrt::ThreadPool::*)(), pbrt::ThreadPool*>(std::__invoke_memfun_deref, void (pbrt::ThreadPool::*&&)(), pbrt::ThreadPool*&&) + 0x6e
(./pbrt_test ) 0x0x5560fe102c02 - std::__invoke_result<void (pbrt::ThreadPool::*)(), pbrt::ThreadPool*>::type std::__invoke<void (pbrt::ThreadPool::*)(), pbrt::ThreadPool*>(void (pbrt::ThreadPool::*&&)(), pbrt::ThreadPool*&&) + 0x32
(./pbrt_test ) 0x0x5560fe102bc2 - void std::thread::_Invoker<std::tuple<void (pbrt::ThreadPool::*)(), pbrt::ThreadPool*> >::_M_invoke<0ul, 1ul>(std::_Index_tuple<0ul, 1ul>) + 0x42
(./pbrt_test ) 0x0x5560fe102b75 - std::thread::_Invoker<std::tuple<void (pbrt::ThreadPool::*)(), pbrt::ThreadPool*> >::operator()() + 0x15
(./pbrt_test ) 0x0x5560fe1029d9 - std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (pbrt::ThreadPool::*)(), pbrt::ThreadPool*> > >::_M_run() + 0x19
(/lib/x86_64-linux-gnu/libstdc++.so.6 ) 0x0x7fa39eedc253 - (unknown) + 0xdc253
(/lib/x86_64-linux-gnu/libc.so.6 ) 0x0x7fa39ea94b43 - (unknown) + 0x94b43
(/lib/x86_64-linux-gnu/libc.so.6 ) 0x0x7fa39eb26a00 - (unknown) + 0x126a00
Rendering failed. Debug with --debugstart 1,0"
I got as far as MLTIntegrator::Render()
calling
int bootstrapIndex = bootstrapTable.Sample(rng.Uniform<Float>());
to
int AliasTable::Sample(Float u, Float *pmf, Float *uRemapped) const {
// Compute alias table _offset_ and remapped random sample _up_
int offset = std::min<int>(u * bins.size(), bins.size() - 1);
Float up = std::min<Float>(u * bins.size() - offset, OneMinusEpsilon);
if (up < bins[offset].q) {
// Return sample for alias table at _offset_
DCHECK_GT(bins[offset].p, 0);
good luck with the debugging
@winterismute you were right about the ffp-contract
stuff, though I'm glad I took the time to chase down the exact problem. In short, given these lines of code:
int offset = std::min<int>(u * bins.size(), bins.size() - 1);
Float up = std::min<Float>(u * bins.size() - offset, OneMinusEpsilon);
clang would use an unasked-for FMA instruction that led to up
being negative in some cases, even if u
was between 0 and 1. That in turn led to the assertion failures. That general form of computation is used a lot in pbrt's sampling code so indeed, the fix was to disable that silly new ffp-contract
default...
I was trying pbrt on macos on Apple Silicon, the project builds ok with eg.
However, if I run
pbrt_test
I see:I see you have an ARM osx machine on the gihub CI, but maybe you only check compilation there? My clang --version:
I could replicate the error on another M2 machine (this is an M1 one).
It does unfortunately seem to be an AS-related issue: if I first do
/usr/bin/arch -arch x86_64 /bin/zsh
in the terminal, triggering Rosetta, and repeat all of the above, I don't see the failure.Any chance you might look at this problem?