Complex lazy kernels do not run on Nvidia GPUs

matthiasdiener commented 3 years ago

Although it works on CPUs, building the fused kernel (_pt_kernel) on Nvidia GPUs results in the following error message:

$ POCL_CACHE_DIR=. POCL_LEAVE_KERNEL_COMPILER_TEMP_FILES=1 python examples/wave-lazy.py
Choose platform:
[0] <pyopencl.Platform 'Portable Computing Language' at 0x7fcfb2f53008>
Choice [0]:
Choose device(s):
[0] <pyopencl.Device 'pthread-Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz' on 'Portable Computing Language' at 0x561404339780>
[1] <pyopencl.Device 'GeForce GTX TITAN X' on 'Portable Computing Language' at 0x561404339ab8>
[2] <pyopencl.Device 'GeForce GTX TITAN X' on 'Portable Computing Language' at 0x561404339df0>
Choice, comma-separated [0]:1
Set the environment variable PYOPENCL_CTX=':1' to avoid being asked again.
450 elements
CUDA_ERROR_INVALID_PTX: a PTX JIT compilation failed
Aborted (core dumped)

The reason for this error is that the function takes too many arguments (~760 in this case, with Nvidia PTX supporting a maximum of ~500):

$ ptxas ./JO/KEBDIDFLAEGPGHDIAHICHOCGMFPANCCKBADDJ/_pt_kernel/0-0-0/parallel.bc.ptx --verbose
ptxas ./JO/KEBDIDFLAEGPGHDIAHICHOCGMFPANCCKBADDJ/_pt_kernel/0-0-0/parallel.bc.ptx, line 15407; 
error   : Entry function '_pt_kernel' uses too much parameter space (0x17c4 bytes, 0x1100 max).
ptxas fatal   : Ptx assembly aborted due to errors

parallel.bc.ptx

There seems to be no workaround for this on the PTX side, so this needs to be fixed on the lazy eval side. Potential fixes include:

Only pass arguments that are actually used in the kernel.
Pack the arguments into (1 or more) structs.

cc: @kaushikcfd @inducer

inducer commented 3 years ago

Only pass arguments that are actually used in the kernel.

IMO this is the thing that'll have to fix it here. For single-work-item, we generate one gigantic kernel, which obviously will take tons of arguments. (We may be able to reduce this some, but that's almost beside the point.) Single-work-item is not a good way to talk to a GPU anyway, so the fact that it dies because it has too many arguments is not super important.

As soon as we go to "parallelized" aka launching multiple work items, we will have to split the workload into multiple GPU kernels, which each require fewer arguments. As long as we can get Loopy to realize that (which may require a little bit of work), this should resolve itself almost automatically.

matthiasdiener commented 2 years ago

This seems to be partially fixed at least: With the current package versions (07/23), at least wave-lazy and pulse-lazy seem to run fine on Nvidia GPUs.

Edit: On the other hand, isolator does not work, failing with the same error as above. See https://gist.github.com/matthiasdiener/46e2b9b5afb08c6dccb9ebd55710ff78 for the kernel.

illinois-ceesd / mirgecom

Complex lazy kernels do not run on Nvidia GPUs #279