mdesmedt commented 3 years ago

I noticed while profiling my simple Rust implementation of "Ray tracing in one weekend" a considerable CPU hotspot was heap alloc/free of memory around BVH.traverse(). This was because traverse() returns a Vec::new for every call.

I implemented an alternative approach to traverse a BVH. I added traverse_iterator() which returns Iterator instead of a Vec, so the calling function can iterate over hit shapes without any heap allocations.

The code is straightforward, but I'm very new to Rust so please critique it. I also couldn't figure out how to wrangle all the traits to make iterators part of BoundingHierarchy so it's only impl BVH for now.

Benchmark results on a Ryzen 3600x of intersecting sponza with 128 rays per iteration:

test bvh::iter::bench::bench_intersect_128rays_sponza_vec                ... bench:     174,705 ns/iter (+/- 5,857)
test bvh::iter::bench::bench_intersect_128rays_sponza_iter               ... bench:     150,897 ns/iter (+/- 6,722)

The perf gain is even larger on my hobby project, with perf going from ~15 million rays/sec to ~20 million rays/sec. It's a much simpler scene and aggressively multithreaded so perhaps allocation overhead hits harder.

svenstaro commented 3 years ago

Oh, this is great! The implementation looks very clean at a first glance. Let me have a deeper look.

mdesmedt commented 3 years ago

Thanks. I just noticed I didn't run cargo fmt on the code, so I just did that on my branch.

mdesmedt commented 3 years ago

Thanks for the suggestions by the way. I'll make those changes soon, and I'll attempt to make this "clippy" happy too :)

svenstaro commented 3 years ago

Thanks for the suggestions by the way. I'll make those changes soon, and I'll attempt to make this "clippy" happy too :)

Clippy is actually pretty cool. I hope you'll come around to appreciate its nagging. :P

codecov-commenter commented 3 years ago

Codecov Report

Merging #61 (5049326) into master (389e531) will increase coverage by 0.44%. The diff coverage is 87.50%.

@@            Coverage Diff             @@
##           master      #61      +/-   ##
==========================================
+ Coverage   79.39%   79.84%   +0.44%     
==========================================
  Files           8        9       +1     
  Lines        1427     1523      +96     
==========================================
+ Hits         1133     1216      +83     
- Misses        294      307      +13

Impacted Files	Coverage Δ
src/bvh/iter.rs	`87.23% <87.23%> (ø)`
src/bvh/bvh_impl.rs	`61.99% <100.00%> (+0.20%)`	:arrow_up:
src/ray.rs	`92.91% <0.00%> (-0.79%)`	:arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 389e531...5049326. Read the comment docs.

svenstaro commented 3 years ago

I just ran the code on my machine and I'm barely seeing a difference in performance:

test bvh::iter::bench::bench_intersect_128rays_sponza_iter               ... bench:     154,647 ns/iter (+/- 2,044)
test bvh::iter::bench::bench_intersect_128rays_sponza_vec                ... bench:     156,826 ns/iter (+/- 5,299)

Same numbers even with RUSTFLAGS="-Ctarget-cpu=native" cargo bench --features bench.

I've got a Ryzen Threadripper 3970X so our CPUs are very similar in architecture. I'm wondering why I'm not seeing a greater speedup.

mdesmedt commented 3 years ago

That's very interesting! Are you on Linux perhaps? I wonder if the OS allocator is just way faster on Linux compared to my two platforms..? I've tested on 3 machines I use: Windows 10 3600x Windows 10 i7-5820K MacBook Air Quad i5

The results of cargo +nightly bench --features bench:

Win10 3600x:

test bvh::iter::bench::bench_intersect_128rays_sponza_vec                ... bench:     193,437 ns/iter (+/- 2,947)
test bvh::iter::bench::bench_intersect_128rays_sponza_iter               ... bench:     154,602 ns/iter (+/- 6,441)

Win10 i7-5820K:

test bvh::iter::bench::bench_intersect_128rays_sponza_vec                ... bench:     260,601 ns/iter (+/- 203,922)
test bvh::iter::bench::bench_intersect_128rays_sponza_iter               ... bench:     234,075 ns/iter (+/- 203,418)

MacBook Air Quad-Core i5

test bvh::iter::bench::bench_intersect_128rays_sponza_vec                ... bench:     221,135 ns/iter (+/- 23,602)
test bvh::iter::bench::bench_intersect_128rays_sponza_iter               ... bench:     193,476 ns/iter (+/- 22,844)

If you are on Windows and you are not seeing the speedup, that would be interesting to root-cause... Also, bench here is single-threaded which is (depending on what's underlying heap alloc/free in Rust) the best-case for allocation overhead. I will add multithreaded benchmarks to run this comparison too.

svenstaro commented 3 years ago

Yeah, I'm on Linux. I just kinda assumed you were, too. :)

bench is always single-threaded on purpose so that benchmarks do not influence each other by stealing resources or throttling the system by causing too much thermal stress so I'd not bother getting parallel benchmarks going.

svenstaro commented 3 years ago

FYI I'd even consider making you a contributor if you'd like to work on this in a more involved capacity. :)

mdesmedt commented 3 years ago

Great! I'm happy to contribute a bit more to both the iterator and bvh. My Rust knowledge is still quite limited, but this seems like a good way to grow it. I'm particularly interested in anything perf-related (if that wasn't obvious).

Regarding bench, I understand it, but I'd still be curious if you could try a traverse_vec/traverse_iter in parallel on your machine at some point. Code like this can of course behave very differently in single or multi threaded environment. So I wonder if you can see a bigger perf difference in Linux if there's more contention around the heap. If Linux can pull this off without a bigger delta I'd be really impressed 😄

svenstaro commented 3 years ago

You might wanna give Linux a try then and see how things perform for you. :)

svenstaro / bvh

Add traverse_iterator to BVH as more efficient alternative to traverse() -> Vec<Shape> #61

Codecov Report