Closed jdmaia closed 2 years ago
@paboyle looks like some tests failed even though all of the code changes should be inside the #ifdef GRID_HIP block. Do you have any suggestions?
Looks good, similar to a transpose I did for Nvidia. Thanks !
Hi Julio,
getting a little lower on one GCD - what run flags and compile flags are you using?
@paboyle I'm configuring with the following flags:
../configure --enable-zmobius=no --enable-gparity=no --enable-fermion-reps=no --enable-unified=no --enable-accelerator=hip --enable-comms=none --enable-simd=GPU --enable-gen-simd-width=64 CXX=hipcc MPICXX=mpic
xx
And runnining with:
./Benchmark_ITT --accelerator-threads 8
This brings performance up to 1.75TF/s on the MI250x per GCD more or less. GRID is a tricky code to compile, so specifying launch_bounds helps in improving occupancy on LambdaApply for BenchmarkITT. Running with --accelerator-threads <= 8 should invoke LambdaApply64, anything else invokes the regular one.
Appreciate help testing this! :)