Open pbalcer opened 1 year ago
ping @jandres742 @smaslov-intel @bmyates @igchor @alycm
thanks @pbalcer .
maybe we don't need anything else than UR_ADAPTERS_FORCE_LOAD ? Good thing about it UR_ADAPTERS_FORCE_LOAD is that users can select one time L0, and the next time CUDA, w/o needing to force a specific change in the code. Having a flag urInit might translate then into another env var in SYCL probably, to know which flag to pass there.
if loader sees UR_ADAPTERS_FORCE_LOAD, then loader would just passthrough directly to the adapter selected, right?
are there any disadvantages or limitations on using UR_ADAPTERS_FORCE_LOAD?
Yes, if there's only one adapter specified (it supports a comma-separated list) in UR_ADAPTERS_FORCE_LOAD
, then the direct code path is used.
The only problem with this approach I can think of is that it requires the user to know the exact full path to the adapter (or just the exact library name if the adapter resides in a path that dlopen
can find automatically). Which might not be given if the adapter is installed automatically with some package in a custom location.
Maybe we should have a conf file/dir in /etc/ur.d/
(and something equivalent in windows) that the adapters register themselves in, and the user can then just change the config file to pick an adapter from the ones listed? Would be much more work, but we probably need something like this for windows anyway to address #128.
thanks @pbalcer . Ah, so UR_ADAPTERS_FORCE_LOAD takes full path? I thought it only needed the name of the adapter. Then maybe we need another env var? something like UR_ADAPTER_LIST=
We could either use the same format as ONEAPI_DEVICE_SELECTOR, https://intel.github.io/llvm-docs/EnvironmentVariables.html, or even better, just read ONEAPI_DEVICE_SELECTOR in the UR loader and if only one backend selected, then pass-through.
Yes, UR_ADAPTERS_FORCE_LOAD
takes a path to dynamic libraries to load. As for ONEAPI_DEVICE_SELECTOR
, the plan (#220) right now is to implement it once UR becomes the default path in SYCL, and then seamlessly switch over to filtering only in UR.
QMCPack uses MPI + SYCL + OpenMP. All three SW components can and do offload tasks to the available devices, possibly via different backends. All three could/will become clients of UR in the near future.
The UR_ADAPTERS_FORCE_LOAD option works, even in this situation, because it restricts all clients of UR equitably (all get passthrough or none do).
The ONEAPI_DEVICE_SELECTOR option works, even in this situation, because it restricts all clients of UR equitably (all get passthrough or none do).
The per-call to urInit
option runs into issues around multiple disjoint instances vs. single shared instance.
OpenMP requires only one adapter (because of a requirement for homogeneity of devices, allegedly) -- it will wish always to call urInit
with exactly one platform, even if other clients of UR concurrently call urInit
with multiple platforms or without restricting platforms (i.e. de facto multiple). Does OpenMP get its own instance of UR that uses the passthrough fast-path or does it get a shared instance of UR that uses indirection because some other client asked for that?
The current UR loader, to support multiple adapters, has an indirection layer that creates and maintains wrappers around UR entities (or function class types, i.e.,
platform
,device
and so on) that store a pointer to adapter functions. If there's only one adapter, this layer is unused, and the loader calls the adapter functions directly.This indirection adds an extraneous overhead for applications that use only one adapter but have more available in the system. This issue is to devise a way to allow applications to load and use only the desired adapter implementation, thus avoiding the overhead.
Possible solutions:
UR_ADAPTERS_FORCE_LOAD
environmental variable can be set with a desired adapter, forcing the loader to use it. This is already possible.platform_flags
tourInit
, with a way of selecting a single adapter.UR_PLATFORM_USE_FIRST
. But this might be hard to use since the order of adapters is unspecified.UR_PLATFORM_L0
,UR_PLATFORM_CUDA
,UR_PLATFORM_HIP
, and then this could be used like this:urInit(0, UR_PLATFORM_L0 | UR_PLATFORM_CUDA);
. This would only work with predefined platforms.urInit
, for example:urInit(0, [](struct platform_descriptor *d) -> bool { return strcmp(d->name, "ur_adapter_level_zero") == 0; })
. This might be clunky to use from C, but I think is the most universal.urPlatformUnload
on the ones it doesn't intend to use orurPlatformUseOnlyThis
(can't think of a name right now :-)) on the one it does. This fits into the existing API, but might be error-prone and tricky to implement safely.