intel / compute-runtime

Intel® Graphics Compute Runtime for oneAPI Level Zero and OpenCL™ Driver
MIT License
1.15k stars 234 forks source link

ocl: incorrect atomics behavior on Celeron/Atom with HD Graphics #405

Open hfp opened 3 years ago

hfp commented 3 years ago

Global memory updates using 32-bit atomic behave non-atomic on Intel HD Graphics integrated into Celeron/Atom platform. Specifically, Intel(R) Celeron(R) CPU J3455 @ 1.50GHz (lscpu) with Intel(R) Graphics [0x5a85] (clinfo). It is likely reproducible on similar Celeron/Atom based CPUs with integrated HD Graphics, and perhaps a misconfiguration/enabling of features in the driver stack (either i915 kmd or up the stack aka compute runtime).

How to reproduce:

cd ${HOME}
git clone https://github.com/hfp/libxsmm.git
cd libxsmm
git checkout 885830e65da003fc4c72113239080a7c069647b5
make -j

cd ${HOME}
git clone https://github.com/hfp/dbcsr.git
cd dbcsr
git checkout 0684ae7c14c43d842059f1cfb9b5646594fa9740

cd src/acc
echo "edit acc_bench_smm.c:22 and change 'double' to 'float'"
cd opencl
make

../acc_bench_smm

The console output of below command looks like:

../acc_bench_smm 3 30000 23 23 23 1875 18750 18750
typename (id=1): float
copy-in: 67.8 ms 2.4 GB/s
transpose: 49.1 ms 13.8 GFLOPS/s
device: 44.7 ms 15.2 GFLOPS/s
host: 34.3 ms 19.8 GFLOPS/s
max.error: abs=849.74 rel=1

In the above output (max.error: abs=849.74 rel=1), the error appears due to data races or non-atomic updates. Generally, GEN9 based devices as integrated into Core based processors work just fine (atomic flow). Similar to Core, the Celeron/Atom based OpenCL platform advertises sufficient support for atomic ops like cl_khr_global_int32_base_atomics and cl_khr_global_int32_extended_atomics used by the reproducer.

The reproducer implements atomic FP32-updates using the usual flow based on cmpxchg or xchg. The atomic implementation can be toggled using OPENCL_LIBSMM_SMM_ATOMICS=cmpxchg (default on GEN9), OPENCL_LIBSMM_SMM_ATOMICS=xchg, or OPENCL_LIBSMM_SMM_ATOMICS=0. The latter of which replaces the atomic flow with plain FP32-add ("+=") meant to observe/study performance differences. However on Celeron/Atom based GEN9, the accumulated error due to data races is similar between supposedly atomic flow and non-atomic flow.

AdamCetnerowski commented 3 years ago

We have reproduced the issue and placed the bug in our debug queue, but do not have an ETA for a fix.

hfp commented 3 years ago

Thank you very much!

eero-t commented 2 years ago

We have reproduced the issue and placed the bug in our debug queue, but do not have an ETA for a fix.

@AdamCetnerowski Over a year has passed. Any updates?