madmann91 / bvh

A modern C++ BVH construction and traversal library
MIT License
919 stars 88 forks source link

Fast-BVH Updates #1

Closed ghost closed 4 years ago

ghost commented 4 years ago

I ran your benchmark using a sponza model and compared it to the latest Fast-BVH after I PR I made there. I think results are a bit different at this point. Fast-BVH is now a template library and built the sponza model I had in 70 ms. In your library, it built the same model in 133 ms. Maybe you could verify the results. In my patch to Fast-BVH, I also included an .obj file renderer.

madmann91 commented 4 years ago

Will have a look, thank you for the PR.

madmann91 commented 4 years ago

I cannot get Fast-BVH to render the correct picture. See attached renderings. The correct one is from this library, the other one is from Fast-BVH. This might be due to backface culling on your end, but I'm not really sure. It could also come from an incorrect BVH or an incorrect traversal routine.

Before we go any further, please make sure to get the same picture with the same scene (I suggest https://github.com/jimmiebergmann/Sponza --- this is not the version I used in the pictures below or in the table on the front page but I cannot find the sources of that anymore, as I'm not on my work machine ATM, and that will last until May in the best case scenario). The camera settings are exactly the same as in the modified benchmark in the examples directory of this repository. Provided you did not change the camera code as well in the new version of Fast-BVH, this should be easy to get right. render render2

ghost commented 4 years ago

Thanks for the response! I'll fix this in a short bit

ghost commented 4 years ago

Fixed! It was a problem with the intersection code. I have a benchmark program at https://github.com/tay10r/Fast-BVH-benchmark. The intersection code in the original repository needs a PR to work, so you can use my benchmark program or my fork of the project until it gets merged to upstream.

Here's a picture of the result.

result

If this ends up being a mistake on my part and your code still performs faster, I may end up just porting your code to Fast-BVH. Since PR 15, Fast-BVH supports multiple build algorithms. I was already planning on adding at least one other build algorithm, so yours could be added as well if it works better in some scenarios.

madmann91 commented 4 years ago

So, here are the numbers for 8000x8000, updated for the scene I mentioned in the post above, same POV:

Fast-BVH This lib (fast mode)
Construction: 51ms 48ms
Traversal: 7649ms 1762ms

I suspect that you were running the benchmark tool without the --fast-bvh-build flag. This flag disables the post-build optimizations. Interestingly, the post-build optimizations worsen traversal performance on this scene. This is something that may happen, as the SAH is only a heuristic anyway.

If you plan on working on Fast-BVH to improve its performance, allow me to give you a bit of advice:

Regarding the traversal implementation, you can also get more performance by dropping the indirection when looking up triangles, and by removing all that very branchy code out of your ray-node intersection code. Also order your ray by octant. I can't say that enough. The devil's in the details so only do that once you are already using a good construction algorithm (i.e. not the middle-split --- see the text above).

ghost commented 4 years ago

I did use the --fast-bvh-build flag.

Here's a screenshot, showing the compiler flags as well.

Screenshot from 2020-03-30 12-18-07

It could be that I only have a 4 core CPU, and that the benchmark on the test is 8 core.

Regarding your notes about Fast-BVH. Currently, work is being done to implement LBVH as well as a LDC BVHs as described here both with and without Morton coding (I suspect without Morton codes, the BVH quality might be higher though the build time would be much slower.)

If your aim is to get the best traversal performance, go for a full sweep SAH construction algorithm with spatial splits.

Fast-BVH is a library meant for more than one application. I believe having multiple algorithms to choose from would be best. Full sweep SAH may be too expensive for real-time rendering. I plan on adding the algorithm that best suites real-time rending for my own needs, and anyone requiring higher-quality BVHs for static scenes can implement their own.

madmann91 commented 4 years ago

Well, this must then mean that your construction algorithm does not scale well this higher core counts. A simple LBVH should give you the same performance as splitting in the middle, with faster build times when properly implemented, and more importantly, much better scaling due to its bottom-up nature.

As an alternative, there are also algorithms that are adaptive, i.e, that can generate BVHs of varying quality given a control parameter. This allows you to keep using the same construction algorithm for real-time vs offline rendering. For instance, there's Parallel BVH Construction using Progressive Hierarchical Refinement. I think PLOC has a control parameter as well, but increasing it was not guaranteed to increase performance reliably across the tested scenes in the paper if I recall properly.

madmann91 commented 4 years ago

Also note that your current construction algorithm only considers the largest axis. This is not the case in this library, because it has a significant impact on traversal performance. If you disable that, you should observe similar times, even on your machine, because binning construction is essentially similar to a top-down middle-split strategy, algorithm-wise, it's just a linear partitioning step at every level of the recursion.

ghost commented 4 years ago

Well, this must then mean that your construction algorithm does not scale well this higher core counts.

Not my construction algorithm, but you're right it only uses one core.

As an alternative, there are also algorithms that are adaptive, i.e, that can generate BVHs of varying quality given a control parameter.

Interesting, I haven't seen that before. I don't plan on doing much more work on Fast-BVH once I get have a BVH algorithm that suites my needs, so I may not get around to implementing a BVH like the one you're mentioning. I'll keep it in mind if I ever end up needing it though.

At this point, I think the "issue" has been resolved. The discrepancy seems to be due to a difference in threading.

madmann91 commented 4 years ago

Alright. Closing the issue then.

ghost commented 4 years ago

Also, thank you for the advice. I'll have a look at your lecture and try to keep your points in mind going forward.

madmann91 commented 4 years ago

You're welcome. Let me know if you want to compare the two again, by re-opening the issue for instance. I'll be interested to have a look at what you come up with!

ghost commented 4 years ago

I'm still making comparisons between an LBVH build algorithm I made and your library.

./benchmark --builder locally_ordered_clustering --eye -1000 1000 0 --up 0 1 0 --dir 1 -1 0 --fov 60 ../../lbvh/models/sponza.obj
BVH construction took 102ms
524533 node(s)
Rendering image (1080x720)...
Rendering took 267ms

The BVH build time of the LBVH is at 25 ms, but that's without any post build optimizations. For some reason, the traversal time is around four seconds for the same resolution. I don't think it's the LBVH build quality, but I'm still debugging it.

madmann91 commented 4 years ago

Please let me know when it's ready. Also note that I have changed the camera code a little to have a horizontal FOV, set the proper screen limits [-1,1] on X/Y (instead of [-0.5, 0.5]), and remove the slight offset due to the division by width - 1 instead of width in the original code. This might account for at least some difference.

ghost commented 4 years ago

I can't seem to fix the issue. In a branch of Fast-BVH, which never got merged, the LBVH algorithm had a traversal performance of about 1.1 seconds for the Sponza scene. Now it's at about 3.5 seconds and the build time is around 35ms. Feel free to look at it and critique. It doesn't really have comparable performance to this library unfortunately.

I took the advice you gave on ray octant ordering and removing triangle indirection. I couldn't seem to find the paper that talks about it (Garanzha and. Loop 2010?) so I just referenced your implementation.

madmann91 commented 4 years ago

So, I just had a quick look with perf and it seems your traversal code is good. Here is the assembly of your ray-node intersection code, as shown by perf, with % of the time sampled on the left:

    0.00 :   4066e5: mov    %r10d,%r9d
    0.93 :   4066e8: vxorps %xmm4,%xmm4,%xmm4
    0.01 :   4066ec: shl    $0x5,%r9
    0.18 :   4066f0: add    %r14,%r9
    0.06 :   4066f3: vmovups (%r9),%xmm6
    8.48 :   4066f8: mov    0x10(%r9),%r9
    0.63 :   4066fc: mov    %r9,-0x38(%rsp)
    0.16 :   406701: mov    %r13d,%r9d
    0.01 :   406704: vmovaps %xmm6,-0x48(%rsp)
    1.07 :   40670a: vmovss -0x48(%rsp,%rbx,4),%xmm0
    8.08 :   406710: vmovss -0x48(%rsp,%rsi,4),%xmm1
    0.24 :   406716: vsubss %xmm12,%xmm0,%xmm0
    5.33 :   40671b: vsubss %xmm11,%xmm1,%xmm1
    0.31 :   406720: vmulss %xmm7,%xmm0,%xmm0
    3.67 :   406724: vmulss %xmm15,%xmm1,%xmm1
    0.27 :   406729: vmaxss %xmm1,%xmm0,%xmm0
    2.43 :   40672d: vmovss -0x48(%rsp,%r11,4),%xmm1
    0.01 :   406734: vsubss %xmm14,%xmm1,%xmm1
    0.04 :   406739: vmulss %xmm8,%xmm1,%xmm1
    0.60 :   40673e: vmaxss %xmm1,%xmm0,%xmm9
    2.57 :   406742: vmovss -0x48(%rsp,%r9,4),%xmm0
    0.01 :   406749: vmovss -0x48(%rsp,%rdi,4),%xmm1
    0.04 :   40674f: mov    %ebp,%r9d
    0.01 :   406752: vmaxss %xmm4,%xmm9,%xmm13
    2.48 :   406756: vsubss %xmm11,%xmm0,%xmm0
    0.02 :   40675b: vsubss %xmm12,%xmm1,%xmm1
    0.02 :   406760: vmulss %xmm15,%xmm0,%xmm0
    0.19 :   406765: vmulss %xmm7,%xmm1,%xmm1
    0.41 :   406769: vminss %xmm1,%xmm0,%xmm0
    0.76 :   40676d: vmovss -0x48(%rsp,%r9,4),%xmm1
    0.02 :   406774: vsubss %xmm14,%xmm1,%xmm1
    0.05 :   406779: vmulss %xmm8,%xmm1,%xmm1
    1.22 :   40677e: vminss %xmm1,%xmm0,%xmm0
    0.88 :   406782: mov    0x1c(%rdx),%r9d
    0.05 :   406786: test   %r9d,%r9d

As you can see, the code itself is pretty lean. Very few spills, and a lot of computation (a rule of thumb is that good code on x86/x86_64 should have more add/sub/mul/... than movs, in general). The problem here is the number of samples on the left. For instance, 8% of the samples are on vmovss -0x48(%rsp,%rsi,4),%xmm1. This means that you spend a lot of time on this region of the code, and in this example means that a lot of the time is spent loading node bounds (you can recognize this by the load with offset -0x48(%rsp,%rsi,4), which is produced by indexing using the octant).

What I got of all this is that the culprit is your BVH itself. A very easy way to determine that a BVH is awful (and I mean awful as a result of a bug, not just bad or of poor quality) is to check that the maximum number of primitives per leaf is below a small constant (e.g. 16). If you have a leaf with 100 primitives, not matter how good your traversal algorithm is, it will perform poorly.

One possible explanation for this is if your Morton codes are only 32-bit and there is not enough precision (an LBVH with this bit width is the equivalent of building a 1024x1024x1024 grid). In this case, just switch to 64-bit Morton codes.

Edit: Another cool trick for BVH construction debugging (also holds for other data structures), is to display the number of traversal steps + intersections per ray instead of coloring by the normal. That will help you spot the problem visually.

ghost commented 4 years ago

@madmann91 That is extremely helpful. I was just about to start profiling the code but I've never been able to get an instruction breakdown as you did. Which tool is this? I generally use callgrind. I see you mentioned it's perf.

I have it setup so that 64-bit Morton codes can be used with 64-bit floating point types, but I haven't checked to see if this pairing is useful to the compiler. Maybe I'll end up switching to 64-bit codes for all types.

I'll give the trick you mentioned a shot. That sounds useful!

It does seem like two primitives per leaf is a bit low (no pun intended.) I'm going to see if there's a way to change this in the divide_node function. I haven't seen it done elsewhere, as Karras paper only describes binary leafs, but it's worth a shot.

Thanks for your feedback! Extremely helpful

madmann91 commented 4 years ago

I have implemented a simple mechanism to collect traversal statistics in the benchmarking tool. Use ./benchmark --collect-statistics 0.001 0.01 ... to render an image with 0.001 * traversal-steps in the red channel and 0.01 * intersections in the green channel. This should allow you to compare your BVH visually with whatever comes out of this library.

ghost commented 4 years ago

That's awesome! I can't wait to give this a shot