Closed ghost closed 4 years ago
Will have a look, thank you for the PR.
I cannot get Fast-BVH to render the correct picture. See attached renderings. The correct one is from this library, the other one is from Fast-BVH. This might be due to backface culling on your end, but I'm not really sure. It could also come from an incorrect BVH or an incorrect traversal routine.
Before we go any further, please make sure to get the same picture with the same scene (I suggest https://github.com/jimmiebergmann/Sponza --- this is not the version I used in the pictures below or in the table on the front page but I cannot find the sources of that anymore, as I'm not on my work machine ATM, and that will last until May in the best case scenario). The camera settings are exactly the same as in the modified benchmark in the examples
directory of this repository. Provided you did not change the camera code as well in the new version of Fast-BVH, this should be easy to get right.
Thanks for the response! I'll fix this in a short bit
Fixed! It was a problem with the intersection code. I have a benchmark program at https://github.com/tay10r/Fast-BVH-benchmark. The intersection code in the original repository needs a PR to work, so you can use my benchmark program or my fork of the project until it gets merged to upstream.
Here's a picture of the result.
If this ends up being a mistake on my part and your code still performs faster, I may end up just porting your code to Fast-BVH. Since PR 15, Fast-BVH supports multiple build algorithms. I was already planning on adding at least one other build algorithm, so yours could be added as well if it works better in some scenarios.
So, here are the numbers for 8000x8000, updated for the scene I mentioned in the post above, same POV:
Fast-BVH | This lib (fast mode) | |
---|---|---|
Construction: | 51ms | 48ms |
Traversal: | 7649ms | 1762ms |
I suspect that you were running the benchmark tool without the --fast-bvh-build
flag.
This flag disables the post-build optimizations. Interestingly, the post-build optimizations
worsen traversal performance on this scene. This is something that may happen, as the SAH is
only a heuristic anyway.
If you plan on working on Fast-BVH to improve its performance, allow me to give you a bit of advice:
Regarding the traversal implementation, you can also get more performance by dropping the indirection when looking up triangles, and by removing all that very branchy code out of your ray-node intersection code. Also order your ray by octant. I can't say that enough. The devil's in the details so only do that once you are already using a good construction algorithm (i.e. not the middle-split --- see the text above).
I did use the --fast-bvh-build
flag.
Here's a screenshot, showing the compiler flags as well.
It could be that I only have a 4 core CPU, and that the benchmark on the test is 8 core.
Regarding your notes about Fast-BVH. Currently, work is being done to implement LBVH as well as a LDC BVHs as described here both with and without Morton coding (I suspect without Morton codes, the BVH quality might be higher though the build time would be much slower.)
If your aim is to get the best traversal performance, go for a full sweep SAH construction algorithm with spatial splits.
Fast-BVH is a library meant for more than one application. I believe having multiple algorithms to choose from would be best. Full sweep SAH may be too expensive for real-time rendering. I plan on adding the algorithm that best suites real-time rending for my own needs, and anyone requiring higher-quality BVHs for static scenes can implement their own.
Well, this must then mean that your construction algorithm does not scale well this higher core counts. A simple LBVH should give you the same performance as splitting in the middle, with faster build times when properly implemented, and more importantly, much better scaling due to its bottom-up nature.
As an alternative, there are also algorithms that are adaptive, i.e, that can generate BVHs of varying quality given a control parameter. This allows you to keep using the same construction algorithm for real-time vs offline rendering. For instance, there's Parallel BVH Construction using Progressive Hierarchical Refinement. I think PLOC has a control parameter as well, but increasing it was not guaranteed to increase performance reliably across the tested scenes in the paper if I recall properly.
Also note that your current construction algorithm only considers the largest axis. This is not the case in this library, because it has a significant impact on traversal performance. If you disable that, you should observe similar times, even on your machine, because binning construction is essentially similar to a top-down middle-split strategy, algorithm-wise, it's just a linear partitioning step at every level of the recursion.
Well, this must then mean that your construction algorithm does not scale well this higher core counts.
Not my construction algorithm, but you're right it only uses one core.
As an alternative, there are also algorithms that are adaptive, i.e, that can generate BVHs of varying quality given a control parameter.
Interesting, I haven't seen that before. I don't plan on doing much more work on Fast-BVH once I get have a BVH algorithm that suites my needs, so I may not get around to implementing a BVH like the one you're mentioning. I'll keep it in mind if I ever end up needing it though.
At this point, I think the "issue" has been resolved. The discrepancy seems to be due to a difference in threading.
Alright. Closing the issue then.
Also, thank you for the advice. I'll have a look at your lecture and try to keep your points in mind going forward.
You're welcome. Let me know if you want to compare the two again, by re-opening the issue for instance. I'll be interested to have a look at what you come up with!
I'm still making comparisons between an LBVH build algorithm I made and your library.
./benchmark --builder locally_ordered_clustering --eye -1000 1000 0 --up 0 1 0 --dir 1 -1 0 --fov 60 ../../lbvh/models/sponza.obj
BVH construction took 102ms
524533 node(s)
Rendering image (1080x720)...
Rendering took 267ms
The BVH build time of the LBVH is at 25 ms, but that's without any post build optimizations. For some reason, the traversal time is around four seconds for the same resolution. I don't think it's the LBVH build quality, but I'm still debugging it.
Please let me know when it's ready. Also note that I have changed the camera code a little to have a horizontal FOV, set the proper screen limits [-1,1]
on X/Y (instead of [-0.5, 0.5]
), and remove the slight offset due to the division by width - 1
instead of width
in the original code. This might account for at least some difference.
I can't seem to fix the issue. In a branch of Fast-BVH, which never got merged, the LBVH algorithm had a traversal performance of about 1.1 seconds for the Sponza scene. Now it's at about 3.5 seconds and the build time is around 35ms. Feel free to look at it and critique. It doesn't really have comparable performance to this library unfortunately.
I took the advice you gave on ray octant ordering and removing triangle indirection. I couldn't seem to find the paper that talks about it (Garanzha and. Loop 2010?) so I just referenced your implementation.
So, I just had a quick look with perf
and it seems your traversal code is good. Here is the assembly of your ray-node intersection code, as shown by perf, with % of the time sampled on the left:
0.00 : 4066e5: mov %r10d,%r9d
0.93 : 4066e8: vxorps %xmm4,%xmm4,%xmm4
0.01 : 4066ec: shl $0x5,%r9
0.18 : 4066f0: add %r14,%r9
0.06 : 4066f3: vmovups (%r9),%xmm6
8.48 : 4066f8: mov 0x10(%r9),%r9
0.63 : 4066fc: mov %r9,-0x38(%rsp)
0.16 : 406701: mov %r13d,%r9d
0.01 : 406704: vmovaps %xmm6,-0x48(%rsp)
1.07 : 40670a: vmovss -0x48(%rsp,%rbx,4),%xmm0
8.08 : 406710: vmovss -0x48(%rsp,%rsi,4),%xmm1
0.24 : 406716: vsubss %xmm12,%xmm0,%xmm0
5.33 : 40671b: vsubss %xmm11,%xmm1,%xmm1
0.31 : 406720: vmulss %xmm7,%xmm0,%xmm0
3.67 : 406724: vmulss %xmm15,%xmm1,%xmm1
0.27 : 406729: vmaxss %xmm1,%xmm0,%xmm0
2.43 : 40672d: vmovss -0x48(%rsp,%r11,4),%xmm1
0.01 : 406734: vsubss %xmm14,%xmm1,%xmm1
0.04 : 406739: vmulss %xmm8,%xmm1,%xmm1
0.60 : 40673e: vmaxss %xmm1,%xmm0,%xmm9
2.57 : 406742: vmovss -0x48(%rsp,%r9,4),%xmm0
0.01 : 406749: vmovss -0x48(%rsp,%rdi,4),%xmm1
0.04 : 40674f: mov %ebp,%r9d
0.01 : 406752: vmaxss %xmm4,%xmm9,%xmm13
2.48 : 406756: vsubss %xmm11,%xmm0,%xmm0
0.02 : 40675b: vsubss %xmm12,%xmm1,%xmm1
0.02 : 406760: vmulss %xmm15,%xmm0,%xmm0
0.19 : 406765: vmulss %xmm7,%xmm1,%xmm1
0.41 : 406769: vminss %xmm1,%xmm0,%xmm0
0.76 : 40676d: vmovss -0x48(%rsp,%r9,4),%xmm1
0.02 : 406774: vsubss %xmm14,%xmm1,%xmm1
0.05 : 406779: vmulss %xmm8,%xmm1,%xmm1
1.22 : 40677e: vminss %xmm1,%xmm0,%xmm0
0.88 : 406782: mov 0x1c(%rdx),%r9d
0.05 : 406786: test %r9d,%r9d
As you can see, the code itself is pretty lean. Very few spills, and a lot of computation (a rule of thumb is that good code on x86/x86_64 should have more add/sub/mul/... than movs, in general).
The problem here is the number of samples on the left. For instance, 8% of the samples are on vmovss -0x48(%rsp,%rsi,4),%xmm1
. This means that you spend a lot of time on this region of the code, and in this example means that a lot of the time is spent loading node bounds (you can recognize this by the load with offset -0x48(%rsp,%rsi,4)
, which is produced by indexing using the octant).
What I got of all this is that the culprit is your BVH itself. A very easy way to determine that a BVH is awful (and I mean awful as a result of a bug, not just bad or of poor quality) is to check that the maximum number of primitives per leaf is below a small constant (e.g. 16). If you have a leaf with 100 primitives, not matter how good your traversal algorithm is, it will perform poorly.
One possible explanation for this is if your Morton codes are only 32-bit and there is not enough precision (an LBVH with this bit width is the equivalent of building a 1024x1024x1024 grid). In this case, just switch to 64-bit Morton codes.
Edit: Another cool trick for BVH construction debugging (also holds for other data structures), is to display the number of traversal steps + intersections per ray instead of coloring by the normal. That will help you spot the problem visually.
@madmann91 That is extremely helpful. I was just about to start profiling the code but I've never been able to get an instruction breakdown as you did. Which tool is this? I generally use callgrind. I see you mentioned it's perf
.
I have it setup so that 64-bit Morton codes can be used with 64-bit floating point types, but I haven't checked to see if this pairing is useful to the compiler. Maybe I'll end up switching to 64-bit codes for all types.
I'll give the trick you mentioned a shot. That sounds useful!
It does seem like two primitives per leaf is a bit low (no pun intended.) I'm going to see if there's a way to change this in the divide_node
function. I haven't seen it done elsewhere, as Karras paper only describes binary leafs, but it's worth a shot.
Thanks for your feedback! Extremely helpful
I have implemented a simple mechanism to collect traversal statistics in the benchmarking tool. Use ./benchmark --collect-statistics 0.001 0.01 ...
to render an image with 0.001 * traversal-steps
in the red channel and 0.01 * intersections
in the green channel. This should allow you to compare your BVH visually with whatever comes out of this library.
That's awesome! I can't wait to give this a shot
I ran your benchmark using a sponza model and compared it to the latest Fast-BVH after I PR I made there. I think results are a bit different at this point. Fast-BVH is now a template library and built the sponza model I had in 70 ms. In your library, it built the same model in 133 ms. Maybe you could verify the results. In my patch to Fast-BVH, I also included an .obj file renderer.