jbikker / tinybvh

Single-header BVH construction and traversal library.
MIT License
533 stars 25 forks source link

add a new AMD gpu name to tiny_ocl.h #25

Closed bekorn closed 1 week ago

bekorn commented 1 week ago

AMD Radeon 6700 XT returns "gfx1031" for CL_DEVICE_NAME.

I also tried CL_DEVICE_VENDOR and it returned "Advanced Micro Devices, Inc.", maybe this could be easier to check.

jbikker commented 1 week ago

Perhaps the CL_DEVICE_VENDOR string can be added as a safety net? Unless that is reported for all models.. Sadly I don't have a lot of devices to test on, so I'm grateful for your PR!

bekorn commented 1 week ago

Before fixing this, ISAMD was not defined on traverse.cl > SIMD_AABBTEST was not defined > AABB test used _native_fma > OCL compiler gave the error: "use of undeclared identifier '_native_fma'" > failed speedtest. _native_fma not being recognised seemed like something that shouldn't happen however, I have no experience with OCL unfortunately, just reporting

jbikker commented 1 week ago

So with your patch properly compiles on AMD? Question: could you try out which combination of 'SIMD_AABBTEST' and 'USE_VLOAD_VSTORE' is best for AMD? On Intel, performance is best when I don't use the vload/vstore commands but do use the 'SIMD' aabb test code; on NVIDIA it's the other way round. For AMD I am just guessing.

bekorn commented 1 week ago

If I don't use SIMD_AABBTEST, it doesn't compile because error: "use of undeclared identifier '_native_fma'". Full log:

C:\Users\Berk\AppData\Local\Temp\comgr-56503e\input\CompileSource:385:22: error: use of undeclared identifier '_native_fma'
  385 |                                                 float tminx0 = _native_fma( lox4.x, idirx, origx ), tminx1 = _native_fma( lox4.y, idirx, origx );
      |                                                                ^
C:\Users\Berk\AppData\Local\Temp\comgr-56503e\input\CompileSource:385:68: error: use of undeclared identifier '_native_fma'
  385 |                                                 float tminx0 = _native_fma( lox4.x, idirx, origx ), tminx1 = _native_fma( lox4.y, idirx, origx );
      |                                                                                                              ^
C:\Users\Berk\AppData\Local\Temp\comgr-56503e\input\CompileSource:386:22: error: use of undeclared identifier '_native_fma'
  386 |                                                 float tminy0 = _native_fma( loy4.x, idiry, origy ), tminy1 = _native_fma( loy4.y, idiry, origy );
      |                                                                ^
C:\Users\Berk\AppData\Local\Temp\comgr-56503e\input\CompileSource:386:68: error: use of undeclared identifier '_native_fma'
  386 |                                                 float tminy0 = _native_fma( loy4.x, idiry, origy ), tminy1 = _native_fma( loy4.y, idiry, origy );
      |                                                                                                              ^
C:\Users\Berk\AppData\Local\Temp\comgr-56503e\input\CompileSource:387:22: error: use of undeclared identifier '_native_fma'
  387 |                                                 float tminz0 = _native_fma( loz4.x, idirz, origz ), tminz1 = _native_fma( loz4.y, idirz, origz );
      |                                                                ^
C:\Users\Berk\AppData\Local\Temp\comgr-56503e\input\CompileSource:387:68: error: use of undeclared identifier '_native_fma'
  387 |                                                 float tminz0 = _nativ

With USE_VLOAD_VSTORE:

tiny_bvh version 0.9.5 performance statistics (MSVC 1941 build)
running on AMD Ryzen 5 7600 6-Core Processor
----------------------------------------------------------------
available OpenCL platforms:
#0: AMD Accelerated Parallel Processing
OpenCL device: AMD Accelerated Parallel Processing
Device # 0, (name: gfx1031) (platform: OpenCL 2.0 AMD-APP (3617.0)) (vendor: Advanced Micro Devices, Inc.)
Local memory size: 64KB
hardware detected: AMD.
----------------------------------------------------------------
Loading triangle data (262267 tris).
BVH construction speed
warming caches...
BVH traversal speed
- GPU, coherent,   alt 2-way layout,  ocl:      6.7ms for   7.68M rays => 1139.37MRay/s
- GPU, coherent,   alt 4-way layout,  ocl:      6.2ms for   7.68M rays => 1231.07MRay/s
- GPU, coherent,   CWBVH layout,      ocl:      7.3ms for   7.68M rays => 1047.86MRay/s
all done.

Without USE_VLOAD_VSTORE:

- GPU, coherent,   alt 2-way layout,  ocl:      6.8ms for   7.68M rays => 1133.41MRay/s
- GPU, coherent,   alt 4-way layout,  ocl:      6.2ms for   7.68M rays => 1231.23MRay/s
- GPU, coherent,   CWBVH layout,      ocl:      7.8ms for   7.68M rays => 979.40MRay/s
bekorn commented 1 week ago

With your new commits:

#define USE_VLOAD_VSTORE
//#define SIMD_AABBTEST
- GPU, coherent,   CWBVH layout,      ocl:      7.3ms for   7.68M rays => 1047.28MRay/s
//#define USE_VLOAD_VSTORE
//#define SIMD_AABBTEST
- GPU, coherent,   CWBVH layout,      ocl:      7.9ms for   7.68M rays => 976.71MRay/s
jbikker commented 1 week ago

Thank you for testing that. So the optimal setting is vload on, simd off for AMD, yielding about 1Gray/s.

On Wed, Nov 20, 2024 at 6:01 PM bekorn @.***> wrote:

With your new commits:

define USE_VLOAD_VSTORE

//#define SIMD_AABBTEST

  • GPU, coherent, CWBVH layout, ocl: 7.3ms for 7.68M rays => 1047.28MRay/s

//#define USE_VLOAD_VSTORE //#define SIMD_AABBTEST

  • GPU, coherent, CWBVH layout, ocl: 7.9ms for 7.68M rays => 976.71MRay/s

— Reply to this email directly, view it on GitHub https://github.com/jbikker/tinybvh/pull/25#issuecomment-2489120743, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFH3PEHM5XZYW73PIGSGCU32BS55BAVCNFSM6AAAAABSEWRMVWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBZGEZDANZUGM . You are receiving this because you modified the open/close state.Message ID: @.***>

bekorn commented 1 week ago

I thnik there is a typo here

inline float _native_fma( const float a, const float b, const float c )
{
#ifdef _ISNVIDIA <= should be ISNVIDIA
bekorn commented 1 week ago

Yes, seems like it. Radeon 6700 XT is equivalent to 4060 Ti, fyi

Thank you for testing that. So the optimal setting is vload on, simd off for AMD, yielding about 1Gray/s. On Wed, Nov 20, 2024 at 6:01 PM bekorn @.> wrote: With your new commits: #define USE_VLOAD_VSTORE //#define SIMD_AABBTEST - GPU, coherent, CWBVH layout, ocl: 7.3ms for 7.68M rays => 1047.28MRay/s //#define USE_VLOAD_VSTORE //#define SIMD_AABBTEST - GPU, coherent, CWBVH layout, ocl: 7.9ms for 7.68M rays => 976.71MRay/s — Reply to this email directly, view it on GitHub <#25 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFH3PEHM5XZYW73PIGSGCU32BS55BAVCNFSM6AAAAABSEWRMVWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBZGEZDANZUGM . You are receiving this because you modified the open/close state.Message ID: @.>

jbikker commented 1 week ago

Indeed, thanks.

On Wed, Nov 20, 2024 at 6:04 PM bekorn @.***> wrote:

I thnik there is a typo here

inline float _native_fma( const float a, const float b, const float c ) {

ifdef _ISNVIDIA <= should be ISNVIDIA

— Reply to this email directly, view it on GitHub https://github.com/jbikker/tinybvh/pull/25#issuecomment-2489128774, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFH3PEBNSXMK7C6JL6CGSPT2BS6K5AVCNFSM6AAAAABSEWRMVWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBZGEZDQNZXGQ . You are receiving this because you modified the open/close state.Message ID: @.***>