No static version for linking NCCL (libnccl)

janpfeifer commented 5 months ago

I'm an author of an ML Framework using XLA.

Per issue #11596 in a recent refresh of my build, XLA build fails if I don't include NCCL. The easy fix would be to include NCCL in my build -- also good for other reasons, but my default (and only) distribution of my ML framework works for both CPU & GPU. I achieve this by linking things statically -- also because it's simpler for the end user.

The issue is that NCCL (as opposed to other CUDA libraries), doesn't have a statically linking rule, even though NVidia distributes the libnccl_static.a file. Relevant bazel code.

I assume this is a simple fix, for someone with right "bazel-fu skillz" ... but I'm not sure. Also, there may be other considerations I'm not aware. Any help would be most appreciated!

janpfeifer commented 5 months ago

So, a manual hack to get it to link static is to change the "ncl" rule to:

cc_library(
    name = "nccl",
    srcs = [],
    # srcs = ["libnccl.so.%{nccl_version}"],
    hdrs = ["nccl.h"],
    include_prefix = "third_party/nccl",
    visibility = ["//visibility:public"],
    deps = [
        "@local_config_cuda//cuda:cuda_headers",
    ],
    linkopts = cuda_rpath_flags("nvidia/nccl/lib") + ["-lnccl_static"],
)

(linkopts was added and the srcs was set to empty)

But ideally this would be controlled by the user choice of compiling it statically.

While looking at the code I also found out about the TF_NCCL_USE_STUB variable that if set to "1" (or anything different than "0" or unset) will trigger using the rule I mentioned. Otherwise, it takes another path of building nccl(?).

I tested adding build --action_env TF_USE_STUB=0 to xla_configure.bazzelrc but it didn't help, NCCL was still linked dynamically. But I'm not sure if my value was overwritten...

cheshire commented 5 months ago

@ddunl @PatriosTheGreat WDYT?

openxla / xla

No static version for linking NCCL (libnccl) #11604