Open abagusetty opened 1 year ago
@npmiller @hdelan Any suggestions here. I was bit blocked by this error on frontier and was wondering if you have any pointers. Similar posts: https://github.com/intel/llvm/issues/9018, https://github.com/intel/llvm/issues/7511
All I had is these flags for the build:
-fsycl-targets=amdgcn-amd-amdhsa -Xsycl-target-backend --offload-arch=gfx90a
So for each SYCL kernel the compiler maps this to two versions of the same kernel, one with an extra hidden parameter which can be used for the SYCL global offset (the second one has the same kernel name but with an added _with_offset
. See here https://github.com/intel/llvm/blob/sycl/sycl/plugins/unified_runtime/ur/adapters/hip/kernel.cpp#L26
This failure is because the HIP adapter can't find the kernel with the extra offset arg. Perhaps the extra kernel with offset arg is no longer being generated by compiler. I'm not sure what might have changed here but we can investigate.
Thanks, that makes sense. Also was looking at the recent mlir updates from recent couple of weeks to see where things could have changed. Just curious, why the CI haven't been seeing these issues.
I've tried to run the same app on CUDA machine and that too resulted in a similar error:
[PI] piKernelCreate(
program: 0x78f7b50
kernel_name: _ZTSZN5GauXC13load_balancer5dpcpp19collision_detectionEmmmPKdS3_S3_S3_PiS4_PN4sycl3_V15queueEE34collision_detection_gpu_syclkernel
ret_kernel: 0x7ffcb283c298
)* [CU] cuCtxGetCurrent(
* pctx: <non-printable>
* ) ---> CUDA_SUCCESS
*
* [CU] cuModuleGetFunction(
* hfunc: <non-printable>
* hmod: <non-printable>
* name: _ZTSZN5GauXC13load_balancer5dpcpp19collision_detectionEmmmPKdS3_S3_S3_PiS4_PN4sycl3_V15queueEE34collision_detection_gpu_syclkernel
* ) ---> CUDA_ERROR_NOT_FOUND
*
---> PI_ERROR_INVALID_KERNEL_NAME
That's interesting thanks @abagusetty . Could you possibly post some instructions of how to reproduce this behaviour? Thanks
Also, to answer your initial question, the HIP/CUDA plugins are now working through unified runtime. The old design was
SYCL RT <-> libpi_hip.so
We are in the transition phase to using UR in SYCL RT, but for the moment libpi_hip.so
is actually just the CUDA UR adapter with a small shim layer in order to resolve the pi
calls to ur
calls. This should be changed soon and the SYCL RT will no longer call pi
symbols, but use ur
symbols instead. This is the reason for getting UR ERROR
etc.
See here for the sources going in to libpi_cuda.so
, for instance
https://github.com/intel/llvm/blob/sycl/sycl/plugins/cuda/CMakeLists.txt#L49
@hdelan Thanks for the above info. Currently the source code is a private repo and creating a minimal reproducer has not been so successful yet. Also the app would need dependencies like oneMKL & oneDPL to be built.
Is there some other way I can provide more info on this. (may be with save-temps, etc) The last resort can be providing access to the repo with convenient build instructions to reproduce.
save-temps
might be useful. If you can provide all the PTX files that the compilation produces followed by the error that might help.
It is also worth asking - are these kernels unnamed or are they named? If you name all the kernels in the global namespace then these missing symbol errors will be a little bit easier to parse.
Most are named and only a couple are unnamed. The un-named ones are the ones being called in a loop. So when I tried to name them they had errors because of conflicting names. The errors reported above from the kernel is from named. I have also tried to change the troubled kernel to unnamed. It didn't make a difference.
Will generate the ptx and attach here.
I was trying to use HIP plugin (via
ONEAPI_DEVICE_SELECTOR=ext_oneapi_hip:gpu
) and got an error from UR plugin though. Was the SYCL for HIP now defaulted to use UR plugin or if there is a way to switch and use HIP plugin instead.Commit: 0e499482ac57d0a1cec950afac8a7c8d0f15a48e