intel / compute-runtime

Intel® Graphics Compute Runtime for oneAPI Level Zero and OpenCL™ Driver
MIT License
1.13k stars 232 forks source link

setting environment vars for oneapi makes gpu compute give incorrect output #691

Open Kuratius opened 10 months ago

Kuratius commented 10 months ago

I ran this example:

https://github.com/smistad/OpenCL-Getting-Started/

with environment variables set via

source /opt/intel/oneapi/setvars.sh (this a script from the intel mkl , I ran this because I found a tutorial for testing opencl code that said it would help).

with the following modification: CL_DEVICE_TYPE_ALL in main.c changed to CL_DEVICE_TYPE_GPU

27 + 997 = 0 28 + 996 = 191839498 29 + 995 = 1703542117 30 + 994 = 1762132000 31 + 993 = 1818588270 32 + 992 = 1764635702 33 + 991 = 1845519459 34 + 990 = 26083801 35 + 989 = 0 36 + 988 = -1 37 + 987 = 2147483647

It seems that it always runs correctly for CL_DEVICE_TYPE_CPU , but it only runs correctly for CL_DEVICE_TYPE_GPU if these enviroment variables have not been set.

without the environment variables

clinfo -l
Platform #0: Intel(R) OpenCL Graphics
`-- Device #0: Intel(R) HD Graphics 520

and

27 + 997 = 1024
28 + 996 = 1024
29 + 995 = 1024
30 + 994 = 1024
31 + 993 = 1024
32 + 992 = 1024

with the environment variables:


clinfo -l
Platform #0: Intel(R) OpenCL
`-- Device #0: Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz
Platform #1: Intel(R) OpenCL Graphics
`-- Device #0: Intel(R) HD Graphics 520 ```
Kuratius commented 10 months ago

There is an error checking version of the same code, it seems that this may be related to the issue I am encountering.

ret_num_platforms=2 error: clGetDeviceIDs: CL_DEVICE_NOT_FOUND at 'main.c' line 63

JablonskiMateusz commented 10 months ago

Hi @Kuratius could you please try to filter-out the CPU device (e.g. on ICD level) and check on environment with gpu only?

It works fine for me:


 $ clinfo -l
Platform #0: Intel(R) OpenCL Graphics
 `-- Device #0: Intel(R) HD Graphics 530
Kuratius commented 10 months ago

Hi @Kuratius could you please try to filter-out the CPU device (e.g. on ICD level) and check on environment with gpu only?

How am I supposed to do that? From what I understand setting CL_DEVICE_TYPE_GPU should already have this effect.

Is there some issue with having multiple openCL devices?

Do you need a more detailed clinfo log?

JablonskiMateusz commented 10 months ago

The app by default gets one platform and then from the one platform takes devices based on type. In your case clinfo shows that there are two platforms with one device each. Therefore you should rather iterate over all platforms and select first platform which has any gpu devices, then you may take first gpu device from the platform.

JablonskiMateusz commented 10 months ago

Hi @Kuratius could you please try to filter-out the CPU device (e.g. on ICD level) and check on environment with gpu only?

How am I supposed to do that? From what I understand setting CL_DEVICE_TYPE_GPU should already have this effect.

OpenCL ICD loader by default looks at /etc/OpenCL/vendors/ for .icd files, each runtime exposes its file there to register itself. If you remove CPU related file from there then only GPU platform will be visible in clinfo and therefore also in the app

Kuratius commented 10 months ago

/etc/OpenCL/vendors$ ls intel64.icd intel.icd

I removed intel64.icd and set the environment variables, and it still shows the wrong behavior, and clinfo also still lists two devices. I will try removing the other one.

Now it only lists one device:

clinfo -l Platform #0: Intel(R) OpenCL `-- Device #0: Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz

and it also still shows the wrong behavior.

Maybe the issue is that this code doesnt find a compatible gpu on the first platform it tries, and then just does nothing, and just reads from uninitialized memory.

JablonskiMateusz commented 10 months ago

/etc/OpenCL/vendors$ ls intel64.icd intel.icd

I removed intel64.icd and set the environment variables, and it still shows the wrong behavior, and clinfo also still lists two devices. I will try removing the other one.

Now it only lists one device:

clinfo -l Platform #0: Intel(R) OpenCL `-- Device #0: Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz

and it also still shows the wrong behavior.

please check the second icd file, to expose gpu only

Maybe the issue is that this code doesnt find a compatible gpu on the first platform it tries, and then just does nothing, and just reads from uninitialized memory.

please verify return values from API calls and don't move forward when status is not success