intel / compute-runtime

Intel® Graphics Compute Runtime for oneAPI Level Zero and OpenCL™ Driver
MIT License
1.15k stars 234 forks source link

CL_​OUT_​OF_​RESOURCES when compiling a SPIR-V with a bunch of small kernels #584

Open pjaaskel opened 2 years ago

pjaaskel commented 2 years ago

Is there a (relatively low) size limit for the built SPIR-V modules? I'm getting CL_OUT_OF_RESOURCES when trying to build (via the CHIP-SPV runtime) a unit test in rocPRIM which has a bunch of test kernels. Omitting some of the kernels makes the test pass (I can also enable the omitted ones in turn and they pass if I disable some of the others). This reproduces both via OpenCL and LevelZero.

In this case it's not a question of a large monolithic kernel that might fill up an instruction memory, but a dozen or so of smaller kernels which are launched separately, thus a lazy kernel binary deployment strategy at launch time should avoid an imem limit issue, if that's the case here.

The kernels use a bit of shared memory, but not much. Is there a way to dump more info of the reason for out of resources in the driver?

SPIR-Vs of the working and non-working cases: spvs.zip

JablonskiMateusz commented 2 years ago

@pjaaskel could you share more details about neo driver version?

pjaaskel commented 2 years ago

Seems I have quite an old version (1.0.0). I've been under assumption that I'd get updates through apt package `intel-oneapi-runtime-opencl', but seems it's only the CPU driver? I'm still supposed to upgrade the GPU OpenCL driver via the github .debs? I'm confused. I'll try upgrading via debs to see if the latest version fixes it.

JablonskiMateusz commented 2 years ago

please run clinfo and check Driver Version

pjaaskel commented 2 years ago

Seems I still get the same CL_OUT_OF_RESOURCES problem with Driver Version 22.43.24558. Works when I prune down the number of tests. Is there a way I can debug the actual reason (which resource it runs out) somehow?

pujaltes commented 1 month ago

@pjaaskel, we are getting a similar error. Did you find out how to debug the issue?

Interestingly enough, we only get the issue on Intel GPUs (1550 Max) but it runs perfectly on intel CPUs and Nvidia GPUs.