Open janpfeifer opened 5 months ago
So, a manual hack to get it to link static is to change the "ncl" rule to:
cc_library(
name = "nccl",
srcs = [],
# srcs = ["libnccl.so.%{nccl_version}"],
hdrs = ["nccl.h"],
include_prefix = "third_party/nccl",
visibility = ["//visibility:public"],
deps = [
"@local_config_cuda//cuda:cuda_headers",
],
linkopts = cuda_rpath_flags("nvidia/nccl/lib") + ["-lnccl_static"],
)
(linkopts
was added and the srcs
was set to empty)
But ideally this would be controlled by the user choice of compiling it statically.
While looking at the code I also found out about the TF_NCCL_USE_STUB
variable that if set to "1" (or anything different than "0" or unset) will trigger using the rule I mentioned. Otherwise, it takes another path of building nccl(?).
I tested adding build --action_env TF_USE_STUB=0
to xla_configure.bazzelrc
but it didn't help, NCCL was still linked dynamically. But I'm not sure if my value was overwritten...
@ddunl @PatriosTheGreat WDYT?
I'm an author of an ML Framework using XLA.
Per issue #11596 in a recent refresh of my build, XLA build fails if I don't include NCCL. The easy fix would be to include NCCL in my build -- also good for other reasons, but my default (and only) distribution of my ML framework works for both CPU & GPU. I achieve this by linking things statically -- also because it's simpler for the end user.
The issue is that NCCL (as opposed to other CUDA libraries), doesn't have a statically linking rule, even though NVidia distributes the
libnccl_static.a
file. Relevant bazel code.I assume this is a simple fix, for someone with right "bazel-fu skillz" ... but I'm not sure. Also, there may be other considerations I'm not aware. Any help would be most appreciated!