Allow for stateless addressing flags for >4GB allocations for devices to be passed through SYCL

simonlui commented 10 months ago

According to https://github.com/intel/compute-runtime/blob/master/programmers-guide/ALLOCATIONS_GREATER_THAN_4GB.md, there are ways to make allocations greater than 4GB allocations on devices which follows the standard Intel stateful addressing model at this point in time. But you must be able to pass CL_MEM_ALLOW_UNRESTRICTED_SIZE_INTEL or ze_relaxed_allocation_limits_exp_desc_t through OpenCL or Level Zero respectively. Unfortunately, there doesn't seem to be a way to do this through SYCL right now. This applies to anything in the SYCL backend that that would use zeMemAllocDevice, zeMemAllocShared and zeMemAllocHost for Level Zero and clCreateBuffer, clCreateBufferWithProperties, clCreateBufferWithPropertiesINTEL, clSVMAlloc, clSharedMemAllocINTEL, clDeviceMemAllocINTEL, clHostMemAllocINTEL for OpenCL.

Since the compiler here is what essentially takes in SYCL and spits out Level Zero or OpenCL code for various Intel projects, I think this is the right place to discuss this. Unfortunately, I'm not sure what it would take for this to happen. Would this become a non-standard extension to SYCL like a vendor extension or would something like this need to get standardized? The reason I am opening this is because this seems to be affecting downstream packages like oneDNN here and Intel Extension for Pytorch here where they use SYCL to make their allocations and are hitting this limitation. IPEX is choosing to limit allocations to 4GB only and disallowing >4GB allocations which I don't think is a good solution given there are valid usecases for needing to use more than 4GB even if it involves a performance penalty. I hope this can be considered and some path forward can be made. Thank you.

abagusetty commented 10 months ago

By chance tried this already: export SYCL_PROGRAM_COMPILE_OPTIONS=" -ze-opt-greater-than-4GB-buffer-required"

simonlui commented 10 months ago

I don't doubt that that would allow you to pass the required compile flags for >4GB allocations. But according to the document I linked, that doesn't solve the issue with passing the flags I mentioned which is needed for the allocation to work correctly. I also don't have an application personally that would use this, this is more or less a gap I identified given the issues I had with this limitation when using Intel's Extension for Pytorch and running into frequently this 4GB memory limit. That is why I submitted this report.

intel / llvm

Allow for stateless addressing flags for >4GB allocations for devices to be passed through SYCL #10946