Improve performance - Githubissues

jalberse commented 10 months ago

Working with the cornell.pbrt scene, we can compare where we're taking up more time relative to PBRT rendering the same scene.

The Triangle Intersection code in particular is hot, and takes up relatively more time than the PBRT implementation. The cross() and difference_of_products() calls in particular are expensive within that function.

I think that issues related to monomorphization/static dispatch should be a nonissue with my Tuple implementations - I would expect that to be inlined. I'd need to verify that though.

I am passing by reference to cross() and difference_of_products(). I think it may be better to pass by value instead; the memcpy should be cheap as tuples are small, and it may be more efficient as you don't need to follow the reference inside the tuple functions (cache efficiency?). PBRT doesn't pass by reference for vectors, at least, and I think it's a good idea.

I'll try switching to pass-by-value and do some benchmarking.

jalberse commented 10 months ago

There's an existing issue to pass tuples by value and not reference already, so if that's the issue we can resolve that too.

jalberse commented 10 months ago

Although this whole topic is getting extremely close to "I would expect the compiler to optimize this" anyways, so I might just be off base.

Switching to pass-by-value on the Tuple interfaces is also just more ergonomic though, so I think I'll still make the change and see.

jalberse commented 10 months ago

Another advantage of removing references is that many of the functions are generic; the caller can just implement the trait bounds for &T if they want to pass a reference to T.

jalberse commented 10 months ago

I did consider that it might be something with f32::mul_add() not being optimized correctly due to some build configuration; it should use the same LLVM intrinsic as PBRT's C++ implementation. But maybe I am not telling the compiler to compile for the architecture in the same way?

But replacing the difference_of_products() call with a naive implementation of ab - c d actually decreased the runtime (though, losing ETF properties), where I'd expect FMA to save time.

It's just an extremely hot piece of code so I think small optimization issues can pile up.

jalberse commented 10 months ago

Okay as I expected, taking by value and not reference doesn't really impact performance, compilers are smart. It's a better interface though.

jalberse commented 10 months ago

I'm leaning towards the issue being something with f32::mul_add() and the target architecture not taking advantage of it.

https://www.reddit.com/r/rust/comments/hvsrl5/scalar_code_is_faster_than_simd_in_rust/

This is discussing more in depth SIMD code than just calling fmaf but possibly related? Do we need to flag functions to allow inlining?

jalberse commented 10 months ago

Specifically, this:

While architectures like x86_64 or ARMv8 define the lowest-common denominator of instructions that all CPUs must support, many CPUs extend these with vector ([AVX](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions)), bitwise manipulation ([BMI](https://en.wikipedia.org/wiki/Bit_Manipulation_Instruction_Sets)) and/or cryptographic ([AES](https://en.wikipedia.org/wiki/AES_instruction_set)) instruction sets

And the f32::mul_add() documentation: https://doc.rust-lang.org/std/primitive.f64.html#method.mul_add

Using mul_add may be more performant than an unfused multiply-add if the target architecture has a dedicated fma CPU instruction. However, this is not always true, and will be heavily dependant on designing algorithms with specific target hardware in mind.

jalberse commented 10 months ago

Performance is much worse relatively when working with killeroo scenes. Possibly in aggregate traversal? Do profiling

jalberse commented 9 months ago

Fixed, was an issue with BVH construction.

jalberse / shimmer

Improve performance #50