Closed jswaro closed 1 year ago
Below is a proof of concept which was used to collect the measured difference between a patch and unpatched version of libfabric 1.18. The proof of concept does not suggest that this is the only way -- or even best way to implement the proposed feature. However, this is what was used to collect the data and prove the concept.
diff --git a/src/fabric.c b/src/fabric.c
index 50ba75e93..45683213e 100644
--- a/src/fabric.c
+++ b/src/fabric.c
@@ -81,6 +81,7 @@ int ofi_init = 0;
extern struct ofi_common_locks common_locks;
static struct ofi_filter prov_filter;
+static char *provider_filter = NULL;
static struct ofi_prov *
@@ -663,6 +664,11 @@ static void ofi_find_prov_libs(void)
if (!prov->prov_name)
continue;
+ if (provider_filter) {
+ if (!strstr(provider_filter, prov->prov_name))
+ continue;
+ }
+
if (ofi_has_util_prefix(prov->prov_name)) {
short_prov_name = prov->prov_name + strlen(OFI_UTIL_PREFIX);
} else if (ofi_has_offload_prefix(prov->prov_name)) {
@@ -823,6 +829,10 @@ void fi_ini(void)
fi_param_get_str(NULL, "provider", ¶m_val);
ofi_create_filter(&prov_filter, param_val);
+ fi_param_define(NULL, "provider_filter", FI_PARAM_STRING,
+ "Only search for the specified provider (default: all available)");
+ fi_param_get_str(NULL, "provider_filter", &provider_filter);
+
fi_param_define(NULL, "fork_unsafe", FI_PARAM_BOOL,
"Whether use of fork() may be unsafe for some providers "
"(default: no). Setting this to yes could improve "
diff --git a/src/hmem.c b/src/hmem.c
index 08a4a0fe0..4614fa610 100644
--- a/src/hmem.c
+++ b/src/hmem.c
@@ -397,8 +397,36 @@ void ofi_hmem_init(void)
{
int iface, ret;
int disable_p2p = 0;
+ char *param_val = NULL;
+
+ fi_param_define(NULL, "hmem_filter", FI_PARAM_STRING,
+ "filters HMEM providers");
+ fi_param_get_str(NULL, "hmem_filter", ¶m_val);
for (iface = 0; iface < ARRAY_SIZE(hmem_ops); iface++) {
+ if (param_val) {
+ switch (iface) {
+ case FI_HMEM_CUDA:
+ if (!strstr(param_val, "cuda"))
+ continue;
+ break;
+ case FI_HMEM_ROCR:
+ if (!strstr(param_val, "rocr"))
+ continue;
+ break;
+ case FI_HMEM_ZE:
+ if (!strstr(param_val, "ze"))
+ continue;
+ break;
+ case FI_HMEM_NEURON:
+ if (!strstr(param_val, "neuron"))
+ continue;
+ break;
+ default:
+ break;
+ }
+ }
+
ret = hmem_ops[iface].init();
if (ret != FI_SUCCESS) {
if (ret == -FI_ENOSYS)
@shefty ^^ just as an FYI.
@iziemba FYI
There is an FI_PROVIDER_PATH variable, which can be used to check a specific path for provider libraries. This option overrides checking the system library paths and may be an option in your use case. That could be expanded to accept something like /dev/null or some specific string to disable searching for DL providers.
There's no equivalent for HMEM, but I would like to first try to align HMEM with the provider filtering.
What I want to do:
I do not want to disable searching for DL providers. This would break the ability for a vendor to ship a vendor implementation of a provider separate of the libfabric core -- and not incur the discovery penalty that we are discussing in this feature proposal.
I would be open to modifying the back-end behavior of fi_info to filter the dlopen calls, but that doesn't solve the HMEM filtering problem.
There is an FI_PROVIDER_PATH variable, which can be used to check a specific path for provider libraries. This option overrides checking the system library paths and may be an option in your use case.
In this specific case, it would be far more preferable to not try to load the other providers at all. A customer could use the FI_PROVIDER_PATH to specify a local directory and send the dlopen calls to a local directory -- making it faster, but still failing to find anything.
That could be expanded to accept something like /dev/null or some specific string to disable searching for DL providers.
It's an option, but not one that I want to pursue. Consider that I would like to keep DL providers as an option, but only search for providers that I am interested in finding. The current implementation fi_info assumes that the application writer is interested in all possible options (providers, capabilities, etc) that fit the parameters of the query. However, there is no option in fi_info today that allows the function to filter unnecessary queries if the provider in question isn't what is desired.
Now we could alter the behavior of fi_info to filter provider discovery based on input parameters -- but that fundamentally changes the way that the function behaves internally today. That type of change is something I wanted to talk over with you first before proposing it here.
Another option would be to fix fi_info to filter dlopen calls based on the input parameters from the application writer, and extend the discovery aspect of fi_info to the hints provided by the application writer. By this, I mean that provider and hmem monitors would need to be part of the hints. It would solve the problem going forward, but would then require a new ABI for libfabric to support it. This means that current applications would need to align to a newer libfabric, and this could take some time. Especially for ISVs. However, one of the biggest upsides then becomes that application writers need to be device-aware at compile or run-time. Neither of which is ideal, and would likely cause a significant amount of issues at least at the beginning. I believe this is one of the cases where an environment variable would outshine a solution in libfabric's API.
This could be done via environment variables today, leading into an API solution tomorrow. However, I would like to see a solution that could be shipped without recompiling applications to support the filtering needed.
One of the reasons why this the suggested path forward is that this provides some tuning options for WLMs such as SLURM or others, to help tune the application launch to the target network types available. For example, in the slurm configuration, an administrator could specify that one of the resources on the nodes would be a 'verbs' NIC. In that case, a SLURM plugin, or script could export FI_PROVIDER_FILTER=verbs,tcp,etc... to limit the calls on behalf of the user or application writer to help facilitate discovery in a way that doesn't force integration with the application.
Another option would be to fix fi_info to filter dlopen calls based on the input parameters from the application writer, and extend the discovery aspect of fi_info to the hints provided by the application writer. By this, I mean that provider and hmem monitors would need to be part of the hints. It would solve the problem going forward, but would then require a new ABI for libfabric to support it. This means that current applications would need to align to a newer libfabric, and this could take some time. Especially for ISVs. However, one of the biggest upsides then becomes that application writers need to be device-aware at compile or run-time. Neither of which is ideal, and would likely cause a significant amount of issues at least at the beginning. I believe this is one of the cases where an environment variable would outshine a solution in libfabric's API.
A problem with the suggestion above is that I don't think fi_info/hints are designed to accept multiple options. Maybe that has changed recently, but I don't think it would work for an application writer to provide 'verbs,tcp,udp' and receive all available options for those three core providers today.
I'm asking to solve the provider problem first, then see if an HMEM solution can align to the same model. The provider problems seems to be that searching and trying to load DL providers is slow. Assuming that's the issue:
Environment variables are for administrators. The API is for application writers. It's useful to distinguish between these. Provider selection/filtering is usually an administrative ask, not an application one. If it's coming through the application, it's usually because the admin set some application environment.
Applications can, but should not, modify environment variables. Apps can call fi_getinfo() multiple times, with different hints, to obtain a complex list of results. Libraries should never modify the environment. That breaks apps that link in multiple libraries, each of which uses libfabric underneath.
There is no API way to restrict loading providers, and I don't think we want that option. One of the goals of libfabric is to provide the same API over all providers, so that the app doesn't need to code for only one provider.
There is not a defined relationship between the name of a DL library and the provider name reported by that library. It's frequently the same or close (in the case of util providers), but not mandated. I was previously looking at having the same DL report as multiple providers, so I'd rather not mandate it.
I don't know how you do both "it would be far more preferable to not try to load the other providers at all" mixed with "I would like to keep DL providers as an option, but only search for providers that I am interested in finding". Either we're loading DL providers or not...
FI_PROVIDER_PATH can either reference a directory where 0 or more providers reside, reference a specific DL file to open (new option), or reference some keyword that indicates don't search anywhere (new option). We have an environment variable that acts as a provider filter (FI_PROVIDER). We have a separate variable that changes DL open (FI_PROVIDER_PATH). It's simpler for users if we extend those existing mechanisms and avoid variables that may attempt to drive the state in different directions.
Short version: I think we agree on almost all (>90%) points.
My goal is to reduce the amount of metadata fetches as a result of dlopen calls without removing the dlopen functionality. Any way that we achieve that without hamstringing application writers and administrators is probably fine by me.
I'm asking to solve the provider problem first, then see if an HMEM solution can align to the same model.
Understood. Let's focus on that.
Environment variables are for administrators. The API is for application writers. It's useful to distinguish between these. Provider selection/filtering is usually an administrative ask, not an application one. If it's coming through the application, it's usually because the admin set some application environment.
Not necessarily. Historically, some system maintainers will provide a network-{fabric_type}
module or pre-built binary of certain libraries, such as MPI. These modules or pre-built libraries come with some assumptions built-in. These assumptions come in the form of environment variables or in the API as part of the pre-built.
In the specific case above, users might do module load network-verbs
to pull up an environment that is tuned toward using verbs. Not necessarily administrator driven.
Applications can, but should not, modify environment variables. Apps can call fi_getinfo() multiple times, with different hints, to obtain a complex list of results. Libraries should never modify the environment. That breaks apps that link in multiple libraries, each of which uses libfabric underneath.
Agreed. You and I have both seen this happen multiple times with libibverbs.
There is no API way to restrict loading providers, and I don't think we want that option. One of the goals of libfabric is to provide the same API over all providers, so that the app doesn't need to code for only one provider.
Correct, there is no way to restrict loading providers today. I agree that the goal is to provide the same API over all providers. However, it is the case that sometimes an application knows ahead of time what provider it wants to use, and some aspects of discovery are not required. I also agree that we don't want to restrict loading providers from the API.
There is not a defined relationship between the name of a DL library and the provider name reported by that library. It's frequently the same or close (in the case of util providers), but not mandated. I was previously looking at having the same DL report as multiple providers, so I'd rather not mandate it.
I see where you are going with this. There is a one-to-one mapping today at least. How would you see that changing in the future, especially as it relates to DL provider support?
Would it be possible to mandate some mapping here?
#define PROV_X_LIBNAME A
#define PROV_Y_LIBNAME A
#define PROV_Z_LIBNAME C
where X and Y providers come from the same library, but Z does not.
I don't know how you do both "it would be far more preferable to not try to load the other providers at all" mixed with "I would like to keep DL providers as an option, but only search for providers that I am interested in finding". Either we're loading DL providers or not...
I believe this follows from the prior paragraph. If we know there is a mapping of providers to library names, then we should be able to read the list of providers that could/should be available, and only try to DL load those libraries. The proof of concept achieves this by skipping the DL load for any provider that wasn't selected.
FI_PROVIDER_PATH can either reference a directory where 0 or more providers reside, reference a specific DL file to open (new option), or reference some keyword that indicates don't search anywhere (new option). We have an environment variable that acts as a provider filter (FI_PROVIDER). We have a separate variable that changes DL open (FI_PROVIDER_PATH). It's simpler for users if we extend those existing mechanisms and avoid variables that may attempt to drive the state in different directions.
I can see that. However, will FI_PROVIDER_PATH accept more than one path? Perhaps part of this discussion goes into how we could make FI_PROVIDER_PATH fit this particular use case.
The only reason why I do not like FI_PROVIDER as a filter is that:
Maybe I'm not recalling it well.
FI_PROVIDER allows specifying a list of providers to report and includes an option to negate the filter (any provider not in the list). The filter is applied prior to calling the providers and not on output. It is a provider filter and applies prior to any filter which the app may have specified (such as setting the prov_name in fi_getinfo).
I have already made use that the dl library name is independent from the prov_name reported by that library. In my case, libnet-fi reported itself as the 'mlx' provider to work around an app which tried to use the provider name for selected code paths, rather than capability bits. I also had a patch that allowed that DL to report itself as multiple providers, again to work around a ULP which used the provider name to select code paths.
FI_PROVIDER_PATH allows only trying to open providers found in a specific location. If the admin only copies the DL providers that it wants available into that path, then those will be the only ones that libfabric will attempt to load. If no DL providers should be opened, the path can point to an empty directory, though there's the cost that the directory itself will be checked. The variable supports specifying multiple directories.
FI_PROVIDER allows specifying a list of providers to report and includes an option to negate the filter (any provider not in the list). The filter is applied prior to calling the providers and not on output. It is a provider filter and applies prior to any filter which the app may have specified (such as setting the prov_name in fi_getinfo).
I don't think this is entirely true. Here is the example that I wrote up. Feel free to tell me if I'm interpreting the result incorrectly. Options
./configure --enable-only --enable-tcp
make -j12
minor patch because FI_LOG_LEVEL=debug was not printing the dlopen debug messages in the function
afabdb59f35a:/tmp/workspace # git diff
diff --git a/src/fabric.c b/src/fabric.c
index 45683213e..97bfe3093 100644
--- a/src/fabric.c
+++ b/src/fabric.c
@@ -605,6 +605,8 @@ static void ofi_reg_dl_prov(const char *lib)
struct fi_provider* (*inif)(void);
FI_DBG(&core_prov, FI_LOG_CORE, "opening provider lib %s\n", lib);
+ printf("opening provider lib %s\n", lib);
+
dlhandle = dlopen(lib, RTLD_NOW);
if (dlhandle == NULL) {
Example
./util/fi_info
afabdb59f35a:/tmp/workspace # ./util/fi_info
opening provider lib libefa-fi.so
opening provider lib libpsm2-fi.so
opening provider lib libopx-fi.so
opening provider lib libpsm-fi.so
opening provider lib libusnic-fi.so
opening provider lib libgni-fi.so
opening provider lib libbgq-fi.so
opening provider lib libverbs-fi.so
opening provider lib libnetdir-fi.so
opening provider lib libpsm3-fi.so
opening provider lib libucx-fi.so
opening provider lib librxm-fi.so
opening provider lib librxd-fi.so
opening provider lib libshm-fi.so
opening provider lib libudp-fi.so
opening provider lib libtcp-fi.so
opening provider lib libsockets-fi.so
opening provider lib libnet-fi.so
opening provider lib libhook_perf-fi.so
opening provider lib libhook_trace-fi.so
opening provider lib libhook_debug-fi.so
opening provider lib libhook_noop-fi.so
opening provider lib libhook_hmem-fi.so
opening provider lib libhook_dmabuf_peer_mem-fi.so
opening provider lib libcoll-fi.so
provider: tcp
fabric: 172.17.0.0/16
domain: eth0
version: 118.20
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
...
provider: tcp
fabric: ::1/128
domain: lo
version: 118.20
type: FI_EP_RDM
protocol: FI_PROTO_XNET
# then with FI_PROVIDER=udp -- something that couldn't possibly exist...
afabdb59f35a:/tmp/workspace # FI_PROVIDER=udp ./util/fi_info
opening provider lib libefa-fi.so
opening provider lib libpsm2-fi.so
opening provider lib libopx-fi.so
opening provider lib libpsm-fi.so
opening provider lib libusnic-fi.so
opening provider lib libgni-fi.so
opening provider lib libbgq-fi.so
opening provider lib libverbs-fi.so
opening provider lib libnetdir-fi.so
opening provider lib libpsm3-fi.so
opening provider lib libucx-fi.so
opening provider lib librxm-fi.so
opening provider lib librxd-fi.so
opening provider lib libshm-fi.so
opening provider lib libudp-fi.so
opening provider lib libtcp-fi.so
opening provider lib libsockets-fi.so
opening provider lib libnet-fi.so
opening provider lib libhook_perf-fi.so
opening provider lib libhook_trace-fi.so
opening provider lib libhook_debug-fi.so
opening provider lib libhook_noop-fi.so
opening provider lib libhook_hmem-fi.so
opening provider lib libhook_dmabuf_peer_mem-fi.so
opening provider lib libcoll-fi.so
fi_getinfo: -61
What I'm seeing is that FI_PROVIDER may be a filter on the results of fi_getinfo but it does not prevent discovery of other providers. That act of discovery incurs a penalty that I've described above. For what it's worth, I ran that test with the 1.18.x release branch.
As it is written it functions as a filter on the output of the data from fi_getinfo's discovery, and not as a filter to what fi_getinfo attempts to discover.
I have already made use that the dl library name is independent from the prov_name reported by that library. In my case, libnet-fi reported itself as the 'mlx' provider to work around an app which tried to use the provider name for selected code paths, rather than capability bits. I also had a patch that allowed that DL to report itself as multiple providers, again to work around a ULP which used the provider name to select code paths.
👍 Sounds good.
FI_PROVIDER_PATH allows only trying to open providers found in a specific location. If the admin only copies the DL providers that it wants available into that path, then those will be the only ones that libfabric will attempt to load. If no DL providers should be opened, the path can point to an empty directory, though there's the cost that the directory itself will be checked. The variable supports specifying multiple directories.
According to a brief test, it doesn't seem to attempt to open DL providers in the same way as the previous examples.
3129e4a13da9:/tmp/workspace # FI_PROVIDER_PATH=/tmp/workspace FI_PROVIDER=udp ./util/fi_info
fi_getinfo: -61
There aren't any printed messages because no libraries exist in that folder. However, it still attempts to open the directory and directory entries for the filter function. Adding a empty file for libefa-fi.so does show that it attempts to open it.
The combination of FI_PROVIDER and FI_PROVIDER_PATH could work, but it doesn't limit discovery. It will attempt to load any providers found in the directory.
Yes, FI_PROVIDER and FI_PROVIDER_PATH could work -- if the user/admin sets those values and only the desired provider exists in the FI_PROVIDER_PATH entries.
I am not a fan of the fact that the library will attempt to open any DL provider library in that directory, regardless of whether the user or admin wants it. If the user or admin is going through the trouble to set FI_PROVIDER, then it should limit the discovery to what was asked for.
For what it's worth, FI_PROVIDER could fill the role of the proposed FI_PROVIDER_FILTER if it limits discovery to only what was asked for. Right now, that is the gap.
You're mixing filtering providers with discovery. FI_PROVIDER is a filter applied during discovery. FI_PROVIDER_PATH changes what is discovered. These are separate items, with separate controls. There is not an enforced relationship between the library name and the prov_name that's reported.
In a simple example: librxm-fi.so, does NOT export the "rxm" provider. It exports "ofi_rxm". A vendor could choose to name their DL library something like libprov-v2.34-fi.so, in order to make multiple versions of their provider available. This is supported today and can matter when dealing with different wire protocols coming from the same provider.
A provider such as EFA (libefa-fi.so) requires the 'shm' provider (libshm-fi.so). Setting FI_PROVIDER=efa and using that to impact discovery would break that provider.
Sure. I can understand that.
One of the reasons behind a new environment variable (FI_PROVIDER_FILTER) was to separate the existing functionality and use cases around it from the new functionality identified here. FI_PROVIDER_FILTER could serve the purpose, as it is written to use a comma-delimited list of core providers which could be loaded. If any of those passes, then of course the DL provider library it comes from could be loaded.
I agree that we don't want to break discovery. An option here is FI_PROVIDER_FILTER=efa,shm
which allows discovery of the those specific core providers. It could be extended to the utility providers.
I did experiments with setting FI_PROVIDER=cxi on Frontier and it did NOT reduce the number of dlopen() calls emanating from libfabric. So, at least as it is implemented now, the dlopen() part of discovery isn't being curtailed by the contents of FI_PROVIDER. However, with @jswaro 's proof-of-concept patched libfabric from above, setting FI_PROVIDER_FILTER=cxi did dramatically reduce the number of dlopen() calls being issued by libfabric. This curtailed dlopen() traffic reduced job launch time significantly at scale. I can supply the LD_DEBUG output, timing data, and methodology if you need proof.
I don't care how this gets resolved, but I care that there is some way for the sysadmins and/or users be able to curtail the libfabric dlopen() calls for .so files that won't be found. Having each job launch do this dynamic discovery from every single rank on a multi-thousand node machine is not only wasteful, it can, and did, cause instability on the machine's file servers.
Currently sysadmins are putting in fake .so files via symbolic links located in the early search path(s) in order to stop the search by dlopen() for these non-existent .so files. That sysadmin hack around this problem has already broken some users attempts to use a Julia package, since it saw the fake .so files for CUDA, and then failed to find the symbols for CUDA. (in this case, the HMEM stuff)
If you want to limit provider visibility, the mechanism is FI_PROVIDER. If you want to limit DL searching, the mechanism is FI_PROVIDER_PATH. The latter option may currently require using mkdir and cp/mv of the desired providers into the search path, but is usable today. FI_PROVIDER_PATH expects a list of directories, but could be modified to also accept a list of files.
The amount of file system meta-data traffic grows astoundingly large while looking for .so files that are not there (every single time): number_of_nodes ranks_per_node number_of_dlopen()_calls_that_will_fail * (number_of_dirs_in_LD_LIBRARY_PATH + number_of_dirs_in_RPATH + number_of_dirs_in_RUNPATH)
And although the problem is worse for a very large job, since the dlopen() calls are roughly synchronized across the machine, it is still a problem for lots of small jobs that fill a multi-thousand-node machine. They may not startup synchronously, but they are all still flooding the file servers with this needless traffic.
The mitigations that have been deployed are not ideal, nor simple: 1) Make sure every user is educated to the problem and to modify their batch scripts to pre-stage the used .so files in node-local storage, AND to shrink the LD_LIBRARY_PATH (and RPATH/RUNPATH) to only point to the node-local .so directories. Be especially careful using things built with spack which will put in RUNPATH/RPATH entries for each dependence, since that is how spack fulfills one of its design goals of letting executables that utilize different library versions/configs to co-exist in the combinatorics explosion of modern HPC software stacks. 2) Have the sysadmins hack the /usr/lib* to have dummy .so files to short-circuit the dlopen() search.
I could go on about this, but I hope I've made my point that dynamic discovery of "what .so files are available?" at scale is not desirable if it can't be turned off or filtered in some easy way.
@shefty : I understand your point about FI_PROVIDER and FI_PROVIDER_PATH. Yes, FI_PROVIDER_PATH could work in some cases.
There are two significant gaps in using FI_PROVIDER and FI_PROVIDER_PATH.
Yes, modifying FI_PROVIDER_PATH to use full file paths could work. However, it forces admins and users to know exactly where the libraries should exist or disable the functionality to avoid significant overhead costs. I have some concerns about portability, but I haven't fully explored that thought.
It isn't the job of libfabric to solve the side effects of decisions made when constructing system images. However, there are three options without additional work in libfabric:
The libraries need to be moved, and the change needs to be exported to the environment for everyone to pick up. It isn't trivial, and it increases the burden to users and customers.
You're mixing filtering providers with discovery. FI_PROVIDER is a filter applied during discovery.
FI_PROVIDER only filters the output to the user in terms of available providers, after discovery is complete. The filter occurs at the very end of fi_getinfo. https://github.com/ofiwg/libfabric/blob/main/src/fabric.c#L1322
It doesn't have an impact to the problem that was described above. It does not affect dlopen in any way that I can observe.
FI_PROVIDER_PATH changes what is discovered. These are separate items, with separate controls. There is not an enforced relationship between the library name and the prov_name that's reported.
FI_PROVIDER_PATH affects this problem by stating what directories to look in for the DL providers. Any provider DL library that can be loaded will be loaded. It only affects where dlopen will search. If the directory is a network-mount, then the problem as described above will occur -- to a lesser degree.
@shefty , @timattox : I've been giving this some thought.
For most use cases, FI_PROVIDER_PATH should be good enough. However, for cases where a customer does want DL provider support, and wants the default behavior, then it is not.
So I have thought of an alternative.
A configure option, --enable-restricted-dl. By default, libfabric behaves the same as it does today. When enabled, libfabric will not attempt to load DL providers for any library that wasn't requested at compile time.
diff --git a/configure.ac b/configure.ac
index 234564485..30cb1e223 100644
--- a/configure.ac
+++ b/configure.ac
@@ -539,6 +539,18 @@ AS_IF([test $have_uffd -eq 1],
AC_DEFINE_UNQUOTED([HAVE_UFFD_THREAD_ID], [$have_uffd_thread_id],
[Define to 1 if platform supports userfault fd thread id])
+dnl restricted DL open
+restricted_dl=0
+AC_ARG_ENABLE([restricted_dl],
+ [AC_HELP_STRING([--enable-restricted-dl],
+ [Restricts dlopen to providers which are enabled in the base library.])],
+ [restricted_dl=1],
+ [])
+AC_DEFINE_UNQUOTED([HAVE_RESTRICTED_DL], [$restricted_dl],
+ [Define to 1 to limit the dlopen activity to providers enabled in the base library.])
+
+
+
dnl Check kdreg2 support
kdreg2_enabled=1
have_kdreg2=0
The population of the prov_head
variable is affected by the new option. If the new option is enabled, then if the provider was not available at compilation, then libfabric won't populate it into the list. As a result, this means that libfabric won't attempt to load the driver via dlopen using the standard search (which traverses all of LD_LIBRARY_PATH and /etc/ld.so.cache).
Note, this type of change does not prevent other libraries from being loaded with the alternative scanning option which searches for lib*-fi.so libraries in LD_LIBRARY_PATH or FI_PROVIDER_PATH. However, it does prevent the more expensive searching mechanism which occurs when FI_PROVIDER_PATH is not defined.
This is a solution, but not necessarily the correct one. I want to keep this conversation active.
The primary change occurs within ofi_ordered_provs_init
num_provs = sizeof(ordered_load_list) / sizeof(ordered_load_list[0]);
for (i = 0; i < num_provs; i++) {
if (HAVE_RESTRICTED_DL && !ordered_load_list[i].load)
continue;
prov = ofi_alloc_prov(ordered_load_list[i].provider_name);
if (prov)
ofi_insert_prov(prov);
}
@j-xiong : I'll move this to a pull request and I'd like to iterate over the design there. Is that preferable for you or would you like to continue to discuss it here prior to a PR?
@jswaro I am fine with either way. One thing I want to mention is that the purpose of ofi_ordered_provs_init()
is to keep a list of known providers in certain order. Providers not in this list are still going to be discovered and initialized later.
Is your feature request related to a problem? Please describe. In Libfabric deployments built with dlopen support, it is the case that Libfabric will probe and attempt to initialize each provider and HMEM monitor that has support built into Libfabric. In order to support the widest variety of customers, vendors may compile all HMEM solutions into Libfabric and use dlopen to relax restrictions on deployment instead of directly compiling the library into Libfabric. In this case, if a customer only has a single library for a single vendor (CUDA, as an example), Libfabric will successfully initialize the CUDA HMEM monitor, but probe and fail on the other monitors.
Additionally, if vendors do not compile other providers into the Libfabric deployment, then Libfabric will probe the file system to discover any possible DL provider solutions that exist as part of provider discovery.
In large scale deployments, the cost of traversing all possible library locations and failing at each one can be a substantial cost in terms of time and network operations for monitors that are known by the customer to not exist. At a specific scale, the amount of calls to the network file system can exceed the capacity of the metadata servers, causing an exponential increase to start-up time relative to the size of the job.
Describe the solution you’d like I think it would make sense to have a filter exposed via a environment variable to allow users to define which HMEM monitors and providers can be discovered.
Example 1:
FI_PROVIDER_FILTER=verbs FI_HMEM_FILTER=cuda,rocm ./my_application
The above invocation would limit Libfabric to initialize only the cuda and rocm monitor implementations, and only attempt to discover the verbs core provider. Any other implementation would not be initialized. This would result in fewer dlopen calls, and reduce the overall amount of time spent in start-up and provider discovery.
Example 2: ./my_appication In this case, the environment variable isn’t set. This means that all providers and monitors will be probed, and initialized if present.
Describe alternatives you’ve considered
Alternative 1: source rebuild by customer/user In this case, it would be possible to provide the user with libfabric and the options that the vendor would like them to use when compiling libfabric. Customers/users could then selectively enable the hmem providers without dlopen support, and omit all other implementations. This provides customers with the ability to compile to the target environment, but results in a new requirement on customer/user deployments, and doesn’t solve the problem for heterogenous environments where some vendor GPUs are present on a portion of the hosts in a system, and other vendor GPUs are present on other hosts. In this case, probe/init failures are unavoidable.
Additionally, this does not address the DL provider probe cost.
Alternative 2: Device discovery In this case, it would require changes to libfabric to look for supported devices without opening the libraries. This could be accomplished by searching for NIC/GPU devices on the host, and only calling the provider or HMEM monitor initialization for the appropriate library if the appropriate device was discovered.
This would be a poor solution. If the logic for detection were too complicated, or prone to error, then customers would observe initialization issues that couldn’t be fixed without a patch or new libfabric package from the vendor.
Additional context Using LAMMPS, we have observed a significant decrease in start-up times between an unmodified version of libfabric, and a patched version of libfabric which contains the suggested changes. The measured difference between unpatched and patched libfabric varied from 5% to 40% decreased start-up time, ranging from seconds to 10s of seconds at 4000 nodes, 8 ranks per node.