Closed madmann91 closed 4 years ago
This is so cool! I'm evaluating it right now, I'll post my comments shortly. Looking great at first glance!
Okay, I'm speechless here. I just quickly threw in the INTERPROCEDURAL_OPTIMIZATION
flag to my CMakeLists.txt
to test it out, and what? The hdr.json
scene rendered 60% faster than ever before.
I had tried adding -flto
to the linker flags before, resulting in a slightly smaller binary, but this is just an incredible boost in performance from such a small change.
I'm still exploring this PR, but I can already tell you you've made my year with this. I am ecstatic!
Okay, I've calmed down a bit and looked through all the code, and I liked everything I saw. This will be merged as-is.
Few remarks:
I recommend to either merge those meshes, or to build a top-level BVH that has per-mesh BVHs as leaves.
I think we should go for a top-level BVH here, feels like the more appropriate solution.
This PR does not implement this feature, as it would break the compatibility with the existing acceleration data structure.
You are more than welcome to implement this, changing the existing data structures as you see fit.
I'd love to know which tools you used for profiling as well. I've used the sampling profiler that comes with Xcode (built on DTrace), but apparently I wasn't that good at interpreting the results, as I had no idea it was as bad as you've shown it to be. 😄 I think I just failed to identify the massive amount of function call overhead.
I think the plan going forward would be something like this:
CMakeLists.txt
file to require v3.9
and set the INTERPROCEDURAL_OPTIMIZATION
flag as a stopgap until the heavy functions are inlined in code. As is, just adding that flag boosts performance by ~60% while only slightly increasing build times.What an incredible PR, thanks so much for helping out!
@madmann91 If it's okay by you, I would like to tag you in the README as a contributor.
I use perf
for profiling. It's pretty low-level, but it can perform sampling and can return statistics for cache misses, number of instructions executed, and so on. It's also not intrusive, and has very little overhead, compared to other approaches like gprof
. To sample an executable, use perf record <prog> <args>
(where prog
and args
are the program name and its command line arguments, respectively), and to show the result use perf report
. To get statistics, use perf stat <prog> <args>
. You might want to compile in RelWithDebInfo
to get valuable annotations in the perf
report, otherwise you'll only see assembly without line info.
Thank you for the attribution request, I'm fine with it. In any case it's going to be visible on GitHub's contributors page.
I'd be curious to know what is the performance difference between the new BVH implementation and the old one on more complicated scenes. Do you have some figures? And maybe also, do you have any complicated (ideally only one mesh, but highly tessellated and with a non-uniform primitive distribution, like e.g. Sponza) test scene that I could play around with?
@madmann91 I'll do a lot more testing in the coming days. The most complex scene I have right now is the stormtrooper scene that you see in the readme, but that one also has the meshes separate, and no top-level acceleration structure of course. So I guess that feature is what should be tackled next.
You can download the stormtrooper scene here if you want. And if you do end up making changes to scenes, or new scenes, then use the scripts/bundler.py
to package them up like that zip in the download link.
I'll keep you posted on the performance metrics I measure once I have them.
I'm also blown away by your generosity here, @madmann91! Checked out your bvh project, it looks very impressive :)
@lycium Thank you! It's not much really. This project looked interesting, so I thought I'd just contribute a bit.
This PR implements a Binned SAH BVH builder, based on "On fast Construction of SAH-based Bounding Volume Hierarchies", by I. Wald. The existing Kd-tree/BVH implementation has been kept intact for comparison purposes. It can be enabled again by defining the preprocessor symbol
OLD_KD_TREE
.I have only the provided test scene to measure performance, and I cannot really tell if there's a measurable performance improvement because:
More than 40% of the time is spent in small functions like
vecSub
. Ray traversal is only 4% of the execution time. Thus, even if this made traversal infinitely faster, not much of an improvement would be visible. This is due to the way you organized your code, as most of those small functions are in separate compilation units and unavailable for the inliner. You have basically two options to fix this problem: Use LTO (on my machine it multiplies performance by almost 3), or move these small functions into their respective header file with thestatic inline
qualifiers to avoid linking problems (I would recommend going that route, as it gives you more control over what is inlined and what isn't, and doesn't blow up your compilation times). To enable LTO, you can bump your CMake requirement to something more modern (e.g. 3.4) and use the following:set_property(TARGET c-ray PROPERTY INTERPROCEDURAL_OPTIMIZATION TRUE)
Note that this will not work on old CMake versions (e.g. 2.8 will not enable LTO for gcc). If you want to use LTO but stick to an old CMake version, you need to append-flto
to both the linker and compiler flags for gcc (old versions of gcc might not support LTO).The second problem is that the test scene you provide has around 10 meshes that each have their own BVH, but all are traversed sequentially, completely nullifying the benefit of using an acceleration data structure altogether. I recommend to either merge those meshes, or to build a top-level BVH that has per-mesh BVHs as leaves. The code in this PR is designed to be able to do that, as only bounding boxes and centers are required to build a BVH (so it works for any primitive type, including other BVHs).
Finally, there's a large optimization potential that's not taken advantage of, namely early exit for shadow rays. For that, I have added a
TODO
that suggests to add atMin
andtMax
to your rays, so that you can actually cull nodes more efficiently during traversal (because shadow rays can exit the traversal loop early, upon the first intersection found in the given[tMin, tMax]
range). This in general gives a 20% performance boost when implemented correctly. This PR does not implement this feature, as it would break the compatibility with the existing acceleration data structure.