paboyle / Grid

Data parallel C++ mathematical object library
GNU General Public License v2.0
155 stars 111 forks source link

Changing thread block order and adding launch_bounds #384

Closed jdmaia closed 2 years ago

jdmaia commented 2 years ago

This brings performance up to 1.75TF/s on the MI250x per GCD more or less. GRID is a tricky code to compile, so specifying launch_bounds helps in improving occupancy on LambdaApply for BenchmarkITT. Running with --accelerator-threads <= 8 should invoke LambdaApply64, anything else invokes the regular one.

Appreciate help testing this! :)

jdmaia commented 2 years ago

@paboyle looks like some tests failed even though all of the code changes should be inside the #ifdef GRID_HIP block. Do you have any suggestions?

paboyle commented 2 years ago

Looks good, similar to a transpose I did for Nvidia. Thanks !

paboyle commented 2 years ago

Hi Julio,

getting a little lower on one GCD - what run flags and compile flags are you using?

jdmaia commented 2 years ago

@paboyle I'm configuring with the following flags:

../configure --enable-zmobius=no --enable-gparity=no --enable-fermion-reps=no --enable-unified=no --enable-accelerator=hip --enable-comms=none --enable-simd=GPU --enable-gen-simd-width=64 CXX=hipcc MPICXX=mpic
xx

And runnining with:

./Benchmark_ITT --accelerator-threads 8