Closed mdesmedt closed 3 years ago
Oh, this is great! The implementation looks very clean at a first glance. Let me have a deeper look.
Thanks. I just noticed I didn't run cargo fmt on the code, so I just did that on my branch.
Thanks for the suggestions by the way. I'll make those changes soon, and I'll attempt to make this "clippy" happy too :)
Thanks for the suggestions by the way. I'll make those changes soon, and I'll attempt to make this "clippy" happy too :)
Clippy is actually pretty cool. I hope you'll come around to appreciate its nagging. :P
Merging #61 (5049326) into master (389e531) will increase coverage by
0.44%
. The diff coverage is87.50%
.
@@ Coverage Diff @@
## master #61 +/- ##
==========================================
+ Coverage 79.39% 79.84% +0.44%
==========================================
Files 8 9 +1
Lines 1427 1523 +96
==========================================
+ Hits 1133 1216 +83
- Misses 294 307 +13
Impacted Files | Coverage Δ | |
---|---|---|
src/bvh/iter.rs | 87.23% <87.23%> (ø) |
|
src/bvh/bvh_impl.rs | 61.99% <100.00%> (+0.20%) |
:arrow_up: |
src/ray.rs | 92.91% <0.00%> (-0.79%) |
:arrow_down: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 389e531...5049326. Read the comment docs.
I just ran the code on my machine and I'm barely seeing a difference in performance:
test bvh::iter::bench::bench_intersect_128rays_sponza_iter ... bench: 154,647 ns/iter (+/- 2,044)
test bvh::iter::bench::bench_intersect_128rays_sponza_vec ... bench: 156,826 ns/iter (+/- 5,299)
Same numbers even with RUSTFLAGS="-Ctarget-cpu=native" cargo bench --features bench
.
I've got a Ryzen Threadripper 3970X so our CPUs are very similar in architecture. I'm wondering why I'm not seeing a greater speedup.
That's very interesting! Are you on Linux perhaps? I wonder if the OS allocator is just way faster on Linux compared to my two platforms..? I've tested on 3 machines I use: Windows 10 3600x Windows 10 i7-5820K MacBook Air Quad i5
The results of cargo +nightly bench --features bench
:
Win10 3600x:
test bvh::iter::bench::bench_intersect_128rays_sponza_vec ... bench: 193,437 ns/iter (+/- 2,947)
test bvh::iter::bench::bench_intersect_128rays_sponza_iter ... bench: 154,602 ns/iter (+/- 6,441)
Win10 i7-5820K:
test bvh::iter::bench::bench_intersect_128rays_sponza_vec ... bench: 260,601 ns/iter (+/- 203,922)
test bvh::iter::bench::bench_intersect_128rays_sponza_iter ... bench: 234,075 ns/iter (+/- 203,418)
MacBook Air Quad-Core i5
test bvh::iter::bench::bench_intersect_128rays_sponza_vec ... bench: 221,135 ns/iter (+/- 23,602)
test bvh::iter::bench::bench_intersect_128rays_sponza_iter ... bench: 193,476 ns/iter (+/- 22,844)
If you are on Windows and you are not seeing the speedup, that would be interesting to root-cause... Also, bench here is single-threaded which is (depending on what's underlying heap alloc/free in Rust) the best-case for allocation overhead. I will add multithreaded benchmarks to run this comparison too.
Yeah, I'm on Linux. I just kinda assumed you were, too. :)
bench
is always single-threaded on purpose so that benchmarks do not influence each other by stealing resources or throttling the system by causing too much thermal stress so I'd not bother getting parallel benchmarks going.
FYI I'd even consider making you a contributor if you'd like to work on this in a more involved capacity. :)
Great! I'm happy to contribute a bit more to both the iterator and bvh
. My Rust knowledge is still quite limited, but this seems like a good way to grow it. I'm particularly interested in anything perf-related (if that wasn't obvious).
Regarding bench
, I understand it, but I'd still be curious if you could try a traverse_vec/traverse_iter in parallel on your machine at some point. Code like this can of course behave very differently in single or multi threaded environment. So I wonder if you can see a bigger perf difference in Linux if there's more contention around the heap. If Linux can pull this off without a bigger delta I'd be really impressed 😄
You might wanna give Linux a try then and see how things perform for you. :)
I noticed while profiling my simple Rust implementation of "Ray tracing in one weekend" a considerable CPU hotspot was heap alloc/free of memory around BVH.traverse(). This was because traverse() returns a Vec::new for every call.
I implemented an alternative approach to traverse a BVH. I added traverse_iterator() which returns Iterator instead of a Vec, so the calling function can iterate over hit shapes without any heap allocations.
The code is straightforward, but I'm very new to Rust so please critique it. I also couldn't figure out how to wrangle all the traits to make iterators part of BoundingHierarchy so it's only impl BVH for now.
Benchmark results on a Ryzen 3600x of intersecting sponza with 128 rays per iteration:
The perf gain is even larger on my hobby project, with perf going from ~15 million rays/sec to ~20 million rays/sec. It's a much simpler scene and aggressively multithreaded so perhaps allocation overhead hits harder.