Transfer bandwidth test includes enqueueWriteBuffer and enqueueReadBuffer scenarios, both of these use blocking versions of OpenCL calls. On Intel it seems that some drivers might use CPU instead of GPU to perform the blocking transfer.
The possible solution is to add non-blocking versions of enqueueWriteBuffer and enqueueReadBuffer to the transfer bandwidth test.
As a verification, an output from running on idle machine:
clpeak.exe --transfer-bandwidth -d 1
Platform: Intel(R) OpenCL
Device: Intel(R) HD Graphics 520
Driver version : 26.20.100.7212 (Win64)
Compute units : 24
Clock frequency : 1000 MHz
Transfer bandwidth (GBPS)
enqueueWriteBuffer : 4.65
enqueueReadBuffer : 4.62
enqueueWriteBuffer non-blocking : 4.85
enqueueReadBuffer non-blocking : 4.78
enqueueMapBuffer(for read) : 252677.97
memcpy from mapped ptr : 4.63
enqueueUnmap(after write) : 338588.47
memcpy to mapped ptr : 4.18
and an output from running while an external app was fully utilizing the CPU:
clpeak.exe --transfer-bandwidth -d 1
Platform: Intel(R) OpenCL
Device: Intel(R) HD Graphics 520
Driver version : 26.20.100.7212 (Win64)
Compute units : 24
Clock frequency : 1000 MHz
Transfer bandwidth (GBPS)
enqueueWriteBuffer : 3.03
enqueueReadBuffer : 3.10
enqueueWriteBuffer non-blocking : 4.80
enqueueReadBuffer non-blocking : 4.74
enqueueMapBuffer(for read) : 302311.13
memcpy from mapped ptr : 3.13
enqueueUnmap(after write) : 513012.84
memcpy to mapped ptr : 3.33
As you can see, non-blocking transfers perform exactly the same in both runs.
Transfer bandwidth test includes enqueueWriteBuffer and enqueueReadBuffer scenarios, both of these use blocking versions of OpenCL calls. On Intel it seems that some drivers might use CPU instead of GPU to perform the blocking transfer.
The possible solution is to add non-blocking versions of enqueueWriteBuffer and enqueueReadBuffer to the transfer bandwidth test.
As a verification, an output from running on idle machine:
and an output from running while an external app was fully utilizing the CPU:
As you can see, non-blocking transfers perform exactly the same in both runs.