As per AMD developer comments who debugged the issue:
the test generates kernels based on the device capabilities reported in OCL. In case of Hybrid stack(Orca) OCL runtime reports 32KB of local device memory, but ROCm stack – 64KB.
The tests uses a half of the reported amount for local array in a kernel. Thus ROCm ends up with more LDS usage, hence lower wave occupancy and lower performance. The issue should be reported to devrel for test logic replacement.
@vetter @Finomnis
As per AMD developer comments who debugged the issue: the test generates kernels based on the device capabilities reported in OCL. In case of Hybrid stack(Orca) OCL runtime reports 32KB of local device memory, but ROCm stack – 64KB. The tests uses a half of the reported amount for local array in a kernel. Thus ROCm ends up with more LDS usage, hence lower wave occupancy and lower performance. The issue should be reported to devrel for test logic replacement.