llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
27.82k stars 11.46k forks source link

[OpenMP][Offload] runtime fails to launch kernels with more than 32 arguments #56389

Open FabioLuporini opened 2 years ago

FabioLuporini commented 2 years ago

Description

The title is probably self-explanatory. "Fat" OpenMP parallel loops, that is parallel loops with a relatively large number of symbols, won't offload producing the following error message:

...
Too many arguments in kmp_invoke_microtask, aborting execution.
Too many arguments in kmp_invoke_microtask, aborting execution.
CUDA error: unspecified launch failure
Libomptarget error: Call to targetDataEnd failed, abort target.
Libomptarget error: Failed to process data after launching the kernel.

This is actually no surprise... I tried to debug it (only quite simplistic things admittedly), and eventually ended up here. Both in-line comments and code make it clear that attempting to offload a kernel with more than 32 symbols will fail.

Minimal example

Simple minimal failing example available here As you'll see, that's really a dummy use case... the real-life examples stem from Devito, which generates solvers for partial differential equations from symbolic specification, and as you may guess, very often we end up with kernels with tens of terms...

Comments

I wonder:

Thanks for looking into this. Any help would be appreciated. If the patch is as simple as supporting up to say 128 arguments, I'm happy to help out :) but I doubt...

llvmbot commented 2 years ago

@llvm/issue-subscribers-openmp

jhuber6 commented 2 years ago

Why the implementation doesn't use variadic functions. What am I missing?

Because we don't support them, see here.

This wasn't happening in clang 12 or clang 13 IIRC, I can see from the commit that ships the new DeviceRTL that something has changed. Feels weird I'm the first one bumping into this? I apologise if I missed a duplicate issue. I searched but couldn't find any

Those versions use a different runtime library, The version of the old runtime before it was deleted used the same scheme so I'm not sure, @jdoerfert . As far as I know you're the first person to need that many arguments to a parallel region on the device.

Any workaround while we wait for a patch?

You should be able to put these in a struct instead, that way it's a single argument containing multiple pieces of data fields.

I'm not seeing this issue with clang's AMD pluging or with NVidia's nvc (which I thought to be based on llvm? might be wrong here...)

The vendor compilers use a different OpenMP device runtime library, I don't know what they do.

Thanks for looking into this. Any help would be appreciated. If the patch is as simple as supporting up to say 128 arguments, I'm happy to help out :) but I doubt...

That should be all that's required if you want more arguments to the region.

FabioLuporini commented 2 years ago

Because we don't support them, see here.

ahhh, OK

ou should be able to put these in a struct instead, that way it's a single argument containing multiple pieces of data fields.

yeah I had thought about that, but it's a bit ugly so we've decided we're not going there

That should be all that's required if you want more arguments to the region.

cool, on it then. Will keep u posted later today

FabioLuporini commented 2 years ago

@jhuber6 : https://reviews.llvm.org/D129197