Open matthiasdiener opened 3 years ago
Only pass arguments that are actually used in the kernel.
IMO this is the thing that'll have to fix it here. For single-work-item, we generate one gigantic kernel, which obviously will take tons of arguments. (We may be able to reduce this some, but that's almost beside the point.) Single-work-item is not a good way to talk to a GPU anyway, so the fact that it dies because it has too many arguments is not super important.
As soon as we go to "parallelized" aka launching multiple work items, we will have to split the workload into multiple GPU kernels, which each require fewer arguments. As long as we can get Loopy to realize that (which may require a little bit of work), this should resolve itself almost automatically.
This seems to be partially fixed at least: With the current package versions (07/23), at least wave-lazy and pulse-lazy seem to run fine on Nvidia GPUs.
Edit: On the other hand, isolator does not work, failing with the same error as above. See https://gist.github.com/matthiasdiener/46e2b9b5afb08c6dccb9ebd55710ff78 for the kernel.
Although it works on CPUs, building the fused kernel (
_pt_kernel
) on Nvidia GPUs results in the following error message:The reason for this error is that the function takes too many arguments (~760 in this case, with Nvidia PTX supporting a maximum of ~500):
parallel.bc.ptx
There seems to be no workaround for this on the PTX side, so this needs to be fixed on the lazy eval side. Potential fixes include:
cc: @kaushikcfd @inducer