oneapi-src / unified-runtime

https://oneapi-src.github.io/unified-runtime/
Other
37 stars 117 forks source link

Provide a way for applications to use a single adapter #355

Open pbalcer opened 1 year ago

pbalcer commented 1 year ago

The current UR loader, to support multiple adapters, has an indirection layer that creates and maintains wrappers around UR entities (or function class types, i.e., platform, device and so on) that store a pointer to adapter functions. If there's only one adapter, this layer is unused, and the loader calls the adapter functions directly.

This indirection adds an extraneous overhead for applications that use only one adapter but have more available in the system. This issue is to devise a way to allow applications to load and use only the desired adapter implementation, thus avoiding the overhead.

Possible solutions:

pbalcer commented 1 year ago

ping @jandres742 @smaslov-intel @bmyates @igchor @alycm

jandres742 commented 1 year ago

thanks @pbalcer .

maybe we don't need anything else than UR_ADAPTERS_FORCE_LOAD ? Good thing about it UR_ADAPTERS_FORCE_LOAD is that users can select one time L0, and the next time CUDA, w/o needing to force a specific change in the code. Having a flag urInit might translate then into another env var in SYCL probably, to know which flag to pass there.

if loader sees UR_ADAPTERS_FORCE_LOAD, then loader would just passthrough directly to the adapter selected, right?

are there any disadvantages or limitations on using UR_ADAPTERS_FORCE_LOAD?

pbalcer commented 1 year ago

Yes, if there's only one adapter specified (it supports a comma-separated list) in UR_ADAPTERS_FORCE_LOAD, then the direct code path is used.

The only problem with this approach I can think of is that it requires the user to know the exact full path to the adapter (or just the exact library name if the adapter resides in a path that dlopen can find automatically). Which might not be given if the adapter is installed automatically with some package in a custom location.

Maybe we should have a conf file/dir in /etc/ur.d/ (and something equivalent in windows) that the adapters register themselves in, and the user can then just change the config file to pick an adapter from the ones listed? Would be much more work, but we probably need something like this for windows anyway to address #128.

jandres742 commented 1 year ago

thanks @pbalcer . Ah, so UR_ADAPTERS_FORCE_LOAD takes full path? I thought it only needed the name of the adapter. Then maybe we need another env var? something like UR_ADAPTER_LIST=,,, which takes a comma separated list of adapters to use. If only one passed, then passthrough in the loader is used.

We could either use the same format as ONEAPI_DEVICE_SELECTOR, https://intel.github.io/llvm-docs/EnvironmentVariables.html, or even better, just read ONEAPI_DEVICE_SELECTOR in the UR loader and if only one backend selected, then pass-through.

pbalcer commented 1 year ago

Yes, UR_ADAPTERS_FORCE_LOAD takes a path to dynamic libraries to load. As for ONEAPI_DEVICE_SELECTOR, the plan (#220) right now is to implement it once UR becomes the default path in SYCL, and then seamlessly switch over to filtering only in UR.

Wee-Free-Scot commented 1 year ago

QMCPack uses MPI + SYCL + OpenMP. All three SW components can and do offload tasks to the available devices, possibly via different backends. All three could/will become clients of UR in the near future.

The UR_ADAPTERS_FORCE_LOAD option works, even in this situation, because it restricts all clients of UR equitably (all get passthrough or none do). The ONEAPI_DEVICE_SELECTOR option works, even in this situation, because it restricts all clients of UR equitably (all get passthrough or none do). The per-call to urInit option runs into issues around multiple disjoint instances vs. single shared instance.

OpenMP requires only one adapter (because of a requirement for homogeneity of devices, allegedly) -- it will wish always to call urInit with exactly one platform, even if other clients of UR concurrently call urInit with multiple platforms or without restricting platforms (i.e. de facto multiple). Does OpenMP get its own instance of UR that uses the passthrough fast-path or does it get a shared instance of UR that uses indirection because some other client asked for that?