[SYCL][UR][HIP] Error from UR plugin though HIP plugin was intended to use

intel / llvm

Intel staging area for llvm.org contribution. Home for Intel LLVM-based projects.

Other

1.22k stars 732 forks source link

[SYCL][UR][HIP] Error from UR plugin though HIP plugin was intended to use #10846

Open abagusetty opened 1 year ago

abagusetty commented 1 year ago

I was trying to use HIP plugin (via ONEAPI_DEVICE_SELECTOR=ext_oneapi_hip:gpu) and got an error from UR plugin though. Was the SYCL for HIP now defaulted to use UR plugin or if there is a way to switch and use HIP plugin instead.

UR HIP ERROR:
    Value:           500
    Name:            hipErrorNotFound
    Description:     named symbol not found
    Function:        urKernelCreate
    Source Location: /lustre/orion/gen243/scratch/abagusetty/llvm/sycl/plugins/unified_runtime/ur/adapters/hip/kernel.cpp:23

terminate called after throwing an instance of 'std::runtime_error'
  what():  Native API failed. Native API returns: -999 (Unknown PI error)

Commit: 0e499482ac57d0a1cec950afac8a7c8d0f15a48e

abagusetty commented 1 year ago

@npmiller @hdelan Any suggestions here. I was bit blocked by this error on frontier and was wondering if you have any pointers. Similar posts: https://github.com/intel/llvm/issues/9018, https://github.com/intel/llvm/issues/7511

All I had is these flags for the build:

-fsycl-targets=amdgcn-amd-amdhsa -Xsycl-target-backend --offload-arch=gfx90a

hdelan commented 1 year ago

So for each SYCL kernel the compiler maps this to two versions of the same kernel, one with an extra hidden parameter which can be used for the SYCL global offset (the second one has the same kernel name but with an added _with_offset. See here https://github.com/intel/llvm/blob/sycl/sycl/plugins/unified_runtime/ur/adapters/hip/kernel.cpp#L26

This failure is because the HIP adapter can't find the kernel with the extra offset arg. Perhaps the extra kernel with offset arg is no longer being generated by compiler. I'm not sure what might have changed here but we can investigate.

abagusetty commented 1 year ago

Thanks, that makes sense. Also was looking at the recent mlir updates from recent couple of weeks to see where things could have changed. Just curious, why the CI haven't been seeing these issues.

abagusetty commented 1 year ago

I've tried to run the same app on CUDA machine and that too resulted in a similar error:

[PI] piKernelCreate(
    program: 0x78f7b50
    kernel_name: _ZTSZN5GauXC13load_balancer5dpcpp19collision_detectionEmmmPKdS3_S3_S3_PiS4_PN4sycl3_V15queueEE34collision_detection_gpu_syclkernel
    ret_kernel: 0x7ffcb283c298
)*  [CU] cuCtxGetCurrent(
*     pctx: <non-printable>
*  ) ---> CUDA_SUCCESS
*  
*  [CU] cuModuleGetFunction(
*     hfunc: <non-printable>
*     hmod: <non-printable>
*     name: _ZTSZN5GauXC13load_balancer5dpcpp19collision_detectionEmmmPKdS3_S3_S3_PiS4_PN4sycl3_V15queueEE34collision_detection_gpu_syclkernel
*  ) ---> CUDA_ERROR_NOT_FOUND
*  
 ---> PI_ERROR_INVALID_KERNEL_NAME

hdelan commented 1 year ago

That's interesting thanks @abagusetty . Could you possibly post some instructions of how to reproduce this behaviour? Thanks

hdelan commented 1 year ago

Also, to answer your initial question, the HIP/CUDA plugins are now working through unified runtime. The old design was

SYCL RT  <-> libpi_hip.so

We are in the transition phase to using UR in SYCL RT, but for the moment libpi_hip.so is actually just the CUDA UR adapter with a small shim layer in order to resolve the pi calls to ur calls. This should be changed soon and the SYCL RT will no longer call pi symbols, but use ur symbols instead. This is the reason for getting UR ERROR etc.

See here for the sources going in to libpi_cuda.so, for instance

https://github.com/intel/llvm/blob/sycl/sycl/plugins/cuda/CMakeLists.txt#L49

abagusetty commented 1 year ago

@hdelan Thanks for the above info. Currently the source code is a private repo and creating a minimal reproducer has not been so successful yet. Also the app would need dependencies like oneMKL & oneDPL to be built.

Is there some other way I can provide more info on this. (may be with save-temps, etc) The last resort can be providing access to the repo with convenient build instructions to reproduce.

hdelan commented 1 year ago

save-temps might be useful. If you can provide all the PTX files that the compilation produces followed by the error that might help.

It is also worth asking - are these kernels unnamed or are they named? If you name all the kernels in the global namespace then these missing symbol errors will be a little bit easier to parse.

abagusetty commented 1 year ago

Most are named and only a couple are unnamed. The un-named ones are the ones being called in a loop. So when I tried to name them they had errors because of conflicting names. The errors reported above from the kernel is from named. I have also tried to change the troubled kernel to unnamed. It didn't make a difference.

Will generate the ptx and attach here.