google / nccl-fastsocket

NCCL Fast Socket is a transport layer plugin to improve NCCL collective communication performance on Google Cloud.
Other
109 stars 13 forks source link

Build doesn't work #1

Closed yselivonchyk closed 2 years ago

yselivonchyk commented 2 years ago

With working installation of cuda:

NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2

Build doesnt get too far:

bazel build :all
Extracting Bazel installation...
Starting local Bazel server and connecting to it...
INFO: SHA256 (https://github.com/bazelbuild/rules_pkg/archive/main.zip) = 4c9d7c26c8f1969f6518e5d7d52e947668107eb537c73c724b5a4b2f61646a08
DEBUG: Rule 'rules_pkg' indicated that a canonical reproducible form can be obtained by modifying arguments sha256 = "4c9d7c26c8f1969f6518e5d7d52e947668107eb537c73c724b5a4b2f61646a08"
DEBUG: Repository rules_pkg instantiated at:
  /home/e/nccl-fastsocket/WORKSPACE.bazel:25:13: in <toplevel>
Repository rule http_archive defined at:
  /home/e/.cache/bazel/_bazel_e/f1e3dc12e04fc3258fc15fe372504351/external/bazel_tools/tools/build_defs/repo/http.bzl:336:31: in <toplevel>
ERROR: error loading package '': cannot load '@rules_pkg//toolchains:rpmbuild_configure.bzl': no such file
INFO: Elapsed time: 4.119s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (0 packages loaded)
changlan commented 2 years ago

Thanks for reporting. Should be fixed in https://github.com/google/nccl-fastsocket/commit/284e8279afd46370401348101ed8e5f7168c9e6d. Can you verify?

yselivonchyk commented 2 years ago

Hi, I am still facing issues building the package.

(base) root@50cc2b63176d:/nccl-fastsocket# bazel build :all
Extracting Bazel installation...
Starting local Bazel server and connecting to it...
INFO: SHA256 (https://github.com/bazelbuild/rules_pkg/archive/main.zip) = e50157bd9b1fae89a629bfe0c1d2712ff96d07e9b68ec9c9da3a0b8ce9d1983f
DEBUG: Rule 'rules_pkg' indicated that a canonical reproducible form can be obtained by modifying arguments sha256 = "e50157bd9b1fae89a629bfe0c1d2712ff96d07e9b68ec9c9da3a0b8ce9d1983f"
DEBUG: Repository rules_pkg instantiated at:
  /nccl-fastsocket/WORKSPACE.bazel:25:13: in <toplevel>
Repository rule http_archive defined at:
  /root/.cache/bazel/_bazel_root/002cb6c7007d2c87f04621a3d83dea3b/external/bazel_tools/tools/build_defs/repo/http.bzl:336:31: in <toplevel>
ERROR: /root/.cache/bazel/_bazel_root/002cb6c7007d2c87f04621a3d83dea3b/external/rules_pkg/pkg/private/pkg_files.bzl:377:12: name 'json' is not defined
ERROR: Skipping ':all': while parsing ':all': error loading package '': in /root/.cache/bazel/_bazel_root/002cb6c7007d2c87f04621a3d83dea3b/external/rules_pkg/pkg.bzl: in /root/.cache/bazel/_bazel_root/002cb6c7007d2c87f04621a3d83dea3b/external/rules_pkg/pkg/zip.bzl: in /root/.cache/bazel/_bazel_root/002cb6c7007d2c87f04621a3d83dea3b/external/rules_pkg/pkg/private/zip/zip.bzl: Extension 'pkg/private/pkg_files.bzl' has errors
WARNING: Target pattern parsing failed.
ERROR: while parsing ':all': error loading package '': in /root/.cache/bazel/_bazel_root/002cb6c7007d2c87f04621a3d83dea3b/external/rules_pkg/pkg.bzl: in /root/.cache/bazel/_bazel_root/002cb6c7007d2c87f04621a3d83dea3b/external/rules_pkg/pkg/zip.bzl: in /root/.cache/bazel/_bazel_root/002cb6c7007d2c87f04621a3d83dea3b/external/rules_pkg/pkg/private/zip/zip.bzl: Extension 'pkg/private/pkg_files.bzl' has errors
INFO: Elapsed time: 14.797s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (1 packages loaded)
(base) root@50cc2b63176d:/nccl-fastsocket#

The easy way to reproduce is run docker:

docker pull nvidia/cuda
docker run -it nvidia/cuda /bin/bash

Try build:

apt-get update; apt-get install wget -y
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && \
chmod +x Miniconda3-latest-Linux-x86_64.sh && \
./Miniconda3-latest-Linux-x86_64.sh -b -p && \
source /root/miniconda3/bin/activate && \
python -m pip install --upgrade pip

conda install git bazel -y && \
git clone https://github.com/google/nccl-fastsocket && \
cd nccl-fastsocket && \
bazel build :all
yselivonchyk commented 2 years ago

With updated bazel things got different.

Installation:

apt-get update; apt-get install wget -y
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && \
chmod +x Miniconda3-latest-Linux-x86_64.sh && \
./Miniconda3-latest-Linux-x86_64.sh -b -p && \
source /root/miniconda3/bin/activate && \
python -m pip install --upgrade pip && \
conda install git -y && \
&& \
apt install apt-transport-https curl gnupg -y && \
curl -fsSL https://bazel.build/bazel-release.pub.gpg | gpg --dearmor > bazel.gpg && \
mv bazel.gpg /etc/apt/trusted.gpg.d/ && \
echo "deb [arch=amd64] https://storage.googleapis.com/bazel-apt stable jdk1.8" | tee /etc/apt/sources.list.d/bazel.list && \
apt update && apt install bazel && \
&& \
git clone https://github.com/google/nccl-fastsocket && \
cd nccl-fastsocket && \
bazel build :all

Result:

(base) root@3e1dbf935b8d:/# git clone https://github.com/google/nccl-fastsocket && \
> cd nccl-fastsocket && \
> bazel build :all
Cloning into 'nccl-fastsocket'...
remote: Enumerating objects: 46, done.
remote: Counting objects: 100% (46/46), done.
remote: Compressing objects: 100% (31/31), done.
remote: Total 46 (delta 22), reused 37 (delta 13), pack-reused 0
Receiving objects: 100% (46/46), 30.26 KiB | 1.16 MiB/s, done.
Resolving deltas: 100% (22/22), done.
Starting local Bazel server and connecting to it...
DEBUG: Rule 'rules_pkg' indicated that a canonical reproducible form can be obtained by modifying arguments sha256 = "e50157bd9b1fae89a629bfe0c1d2712ff96d07e9b68ec9c9da3a0b8ce9d1983f"
DEBUG: Repository rules_pkg instantiated at:
  /nccl-fastsocket/WORKSPACE.bazel:25:13: in <toplevel>
Repository rule http_archive defined at:
  /root/.cache/bazel/_bazel_root/002cb6c7007d2c87f04621a3d83dea3b/external/bazel_tools/tools/build_defs/repo/http.bzl:364:31: in <toplevel>
DEBUG: Rule 'nccl' indicated that a canonical reproducible form can be obtained by modifying arguments commit = "7e515921295adaab72adf56ea71a0fafb0ecb5f3", shallow_since = "1625779814 -0700" and dropping ["tag"]
DEBUG: Repository nccl instantiated at:
  /nccl-fastsocket/WORKSPACE.bazel:8:6: in <toplevel>
  /root/.cache/bazel/_bazel_root/002cb6c7007d2c87f04621a3d83dea3b/external/bazel_tools/tools/build_defs/repo/utils.bzl:233:18: in maybe
Repository rule new_git_repository defined at:
  /root/.cache/bazel/_bazel_root/002cb6c7007d2c87f04621a3d83dea3b/external/bazel_tools/tools/build_defs/repo/git.bzl:186:37: in <toplevel>
INFO: Analyzed 8 targets (46 packages loaded, 1819 targets configured).
INFO: Found 8 targets...
INFO: From Compiling utilities.cc:
In file included from utilities.cc:1:
utilities.h:659:21: warning: 'ncclResult_t socketRecv(int, socketAddress*, void*, int)' defined but not used [-Wunused-function]
  659 | static ncclResult_t socketRecv(int fd, union socketAddress* addr, void* ptr,
      |                     ^~~~~~~~~~
utilities.h:648:21: warning: 'ncclResult_t socketSend(int, socketAddress*, void*, int)' defined but not used [-Wunused-function]
  648 | static ncclResult_t socketSend(int fd, union socketAddress* addr, void* ptr,
      |                     ^~~~~~~~~~
utilities.h:637:21: warning: 'ncclResult_t socketWait(int, int, socketAddress*, void*, int, int*)' defined but not used [-Wunused-function]
  637 | static ncclResult_t socketWait(int op, int fd, union socketAddress* addr,
      |                     ^~~~~~~~~~
utilities.h:625:21: warning: 'ncclResult_t socketProgress(int, int, socketAddress*, void*, int, int*)' defined but not used [-Wunused-function]
  625 | static ncclResult_t socketProgress(int op, int fd, union socketAddress* addr,
      |                     ^~~~~~~~~~~~~~
utilities.h:551:21: warning: 'ncclResult_t connectAddress(int*, socketAddress*)' defined but not used [-Wunused-function]
  551 | static ncclResult_t connectAddress(int* fd, union socketAddress* remoteAddr) {
      |                     ^~~~~~~~~~~~~~
utilities.h:511:21: warning: 'ncclResult_t createListenSocket(int*, socketAddress*)' defined but not used [-Wunused-function]
  511 | static ncclResult_t createListenSocket(int *fd, union socketAddress *localAddr) {
      |                     ^~~~~~~~~~~~~~~~~~
utilities.h:475:12: warning: 'int findInterfaces(char*, socketAddress*, int, int)' defined but not used [-Wunused-function]
  475 | static int findInterfaces(char* ifNames, union socketAddress *ifAddrs, int ifNameMaxSize, int maxIfs) {
      |            ^~~~~~~~~~~~~~
INFO: From Compiling net_fastsocket.cc:
net_fastsocket.cc: In function 'void* persistentSocketThread(void*)':
net_fastsocket.cc:1184:19: warning: comparison of integer expressions of different signedness: 'int' and 'std::__atomic_base<unsigned int>::__int_type' {aka 'unsigned int'} [-Wsign-compare]
 1184 |       while (mark == resource->next && *state != stop) {  // no new tasks, wait
      |              ~~~~~^~~~~~~~~~~~~~~~~
net_fastsocket.cc: In instantiation of 'ncclResult_t ncclBufferedSendSocket<BUF_SIZE>::send(void*, int) [with unsigned int BUF_SIZE = 128]':
net_fastsocket.cc:1277:3:   required from here
net_fastsocket.cc:329:11: warning: comparison of integer expressions of different signedness: 'int' and 'unsigned int' [-Wsign-compare]
  329 |     if (s > BUF_SIZE) return ncclInternalError;
      |         ~~^~~~~~~~~~
net_fastsocket.cc:330:17: warning: comparison of integer expressions of different signedness: 'int' and 'unsigned int' [-Wsign-compare]
  330 |     if (cur + s > BUF_SIZE) NCCLCHECK(sync());
      |         ~~~~~~~~^~~~~~~~~~
In file included from net_fastsocket.cc:36:
utilities.h: At global scope:
utilities.h:659:21: warning: 'ncclResult_t socketRecv(int, socketAddress*, void*, int)' defined but not used [-Wunused-function]
  659 | static ncclResult_t socketRecv(int fd, union socketAddress* addr, void* ptr,
      |                     ^~~~~~~~~~
utilities.h:648:21: warning: 'ncclResult_t socketSend(int, socketAddress*, void*, int)' defined but not used [-Wunused-function]
  648 | static ncclResult_t socketSend(int fd, union socketAddress* addr, void* ptr,
      |                     ^~~~~~~~~~
utilities.h:637:21: warning: 'ncclResult_t socketWait(int, int, socketAddress*, void*, int, int*)' defined but not used [-Wunused-function]
  637 | static ncclResult_t socketWait(int op, int fd, union socketAddress* addr,
      |                     ^~~~~~~~~~
utilities.h:625:21: warning: 'ncclResult_t socketProgress(int, int, socketAddress*, void*, int, int*)' defined but not used [-Wunused-function]
  625 | static ncclResult_t socketProgress(int op, int fd, union socketAddress* addr,
      |                     ^~~~~~~~~~~~~~
ERROR: /nccl-fastsocket/BUILD:63:8: MakeDeb google-fast-socket_0.0.5_amd64.deb failed: (Exit 127): make_deb failed: error executing command bazel-out/k8-opt-exec-2B5CBBC6/bin/external/rules_pkg/pkg/private/deb/make_deb '--output=bazel-out/k8-fastbuild/bin/google-fast-socket_0.0.5_amd64.deb' ... (remaining 11 arguments skipped)

Use --sandbox_debug to see verbose messages from the sandbox
/usr/bin/env: 'python3': No such file or directory
INFO: Elapsed time: 18.909s, Critical Path: 1.08s
INFO: 34 processes: 19 internal, 15 processwrapper-sandbox.
FAILED: Build did NOT complete successfully

With the recommended option it turned into:

(base) root@3e1dbf935b8d:/nccl-fastsocket# bazel build :all --sandbox_debug
DEBUG: Rule 'rules_pkg' indicated that a canonical reproducible form can be obtained by modifying arguments sha256 = "e50157bd9b1fae89a629bfe0c1d2712ff96d07e9b68ec9c9da3a0b8ce9d1983f"
DEBUG: Repository rules_pkg instantiated at:
  /nccl-fastsocket/WORKSPACE.bazel:25:13: in <toplevel>
Repository rule http_archive defined at:
  /root/.cache/bazel/_bazel_root/002cb6c7007d2c87f04621a3d83dea3b/external/bazel_tools/tools/build_defs/repo/http.bzl:364:31: in <toplevel>
DEBUG: Rule 'nccl' indicated that a canonical reproducible form can be obtained by modifying arguments commit = "7e515921295adaab72adf56ea71a0fafb0ecb5f3", shallow_since = "1625779814 -0700" and dropping ["tag"]
DEBUG: Repository nccl instantiated at:
  /nccl-fastsocket/WORKSPACE.bazel:8:6: in <toplevel>
  /root/.cache/bazel/_bazel_root/002cb6c7007d2c87f04621a3d83dea3b/external/bazel_tools/tools/build_defs/repo/utils.bzl:233:18: in maybe
Repository rule new_git_repository defined at:
  /root/.cache/bazel/_bazel_root/002cb6c7007d2c87f04621a3d83dea3b/external/bazel_tools/tools/build_defs/repo/git.bzl:186:37: in <toplevel>
INFO: Analyzed 8 targets (0 packages loaded, 0 targets configured).
INFO: Found 8 targets...
ERROR: /nccl-fastsocket/BUILD:63:8: MakeDeb google-fast-socket_0.0.5_amd64.deb failed: (Exit 127): process-wrapper failed: error executing command
  (cd /root/.cache/bazel/_bazel_root/002cb6c7007d2c87f04621a3d83dea3b/sandbox/processwrapper-sandbox/18/execroot/fastsocket && \
  exec env - \
    LANG=en_US.UTF-8 \
    LC_CTYPE=UTF-8 \
    PYTHONIOENCODING=UTF-8 \
    PYTHONUTF8=1 \
    TMPDIR=/tmp \
  /root/.cache/bazel/_bazel_root/install/c87283ec3a7822eea44f4cecb6db792e/process-wrapper '--timeout=0' '--kill_delay=15' bazel-out/k8-opt-exec-2B5CBBC6/bin/external/rules_pkg/pkg/private/deb/make_deb '--output=bazel-out/k8-fastbuild/bin/google-fast-socket_0.0.5_amd64.deb' '--changes=bazel-out/k8-fastbuild/bin/google-fast-socket_0.0.5_amd64.changes' '--data=bazel-out/k8-fastbuild/bin/tarball.tar.gz' '--package=google-fast-socket' '--maintainer=Chang Lan <changlan@google.com>' '--architecture=amd64' '--triggers=@bazel-out/k8-fastbuild/bin/triggers' '--version=0.0.5' '--description=Fast Socket for NCCL 2' '--distribution=unstable' '--urgency=medium' '--recommends=libnccl2')
/usr/bin/env: 'python3': No such file or directory
INFO: Elapsed time: 0.290s, Critical Path: 0.01s
INFO: 2 processes: 2 internal.
FAILED: Build did NOT complete successfully(base) root@3e1dbf935b8d:/nccl-fastsocket# bazel build :all --sandbox_debug
DEBUG: Rule 'rules_pkg' indicated that a canonical reproducible form can be obtained by modifying arguments sha256 = "e50157bd9b1fae89a629bfe0c1d2712ff96d07e9b68ec9c9da3a0b8ce9d1983f"
DEBUG: Repository rules_pkg instantiated at:
  /nccl-fastsocket/WORKSPACE.bazel:25:13: in <toplevel>
Repository rule http_archive defined at:
  /root/.cache/bazel/_bazel_root/002cb6c7007d2c87f04621a3d83dea3b/external/bazel_tools/tools/build_defs/repo/http.bzl:364:31: in <toplevel>
DEBUG: Rule 'nccl' indicated that a canonical reproducible form can be obtained by modifying arguments commit = "7e515921295adaab72adf56ea71a0fafb0ecb5f3", shallow_since = "1625779814 -0700" and dropping ["tag"]
DEBUG: Repository nccl instantiated at:
  /nccl-fastsocket/WORKSPACE.bazel:8:6: in <toplevel>
  /root/.cache/bazel/_bazel_root/002cb6c7007d2c87f04621a3d83dea3b/external/bazel_tools/tools/build_defs/repo/utils.bzl:233:18: in maybe
Repository rule new_git_repository defined at:
  /root/.cache/bazel/_bazel_root/002cb6c7007d2c87f04621a3d83dea3b/external/bazel_tools/tools/build_defs/repo/git.bzl:186:37: in <toplevel>
INFO: Analyzed 8 targets (0 packages loaded, 0 targets configured).
INFO: Found 8 targets...
ERROR: /nccl-fastsocket/BUILD:63:8: MakeDeb google-fast-socket_0.0.5_amd64.deb failed: (Exit 127): process-wrapper failed: error executing command
  (cd /root/.cache/bazel/_bazel_root/002cb6c7007d2c87f04621a3d83dea3b/sandbox/processwrapper-sandbox/18/execroot/fastsocket && \
  exec env - \
    LANG=en_US.UTF-8 \
    LC_CTYPE=UTF-8 \
    PYTHONIOENCODING=UTF-8 \
    PYTHONUTF8=1 \
    TMPDIR=/tmp \
  /root/.cache/bazel/_bazel_root/install/c87283ec3a7822eea44f4cecb6db792e/process-wrapper '--timeout=0' '--kill_delay=15' bazel-out/k8-opt-exec-2B5CBBC6/bin/external/rules_pkg/pkg/private/deb/make_deb '--output=bazel-out/k8-fastbuild/bin/google-fast-socket_0.0.5_amd64.deb' '--changes=bazel-out/k8-fastbuild/bin/google-fast-socket_0.0.5_amd64.changes' '--data=bazel-out/k8-fastbuild/bin/tarball.tar.gz' '--package=google-fast-socket' '--maintainer=Chang Lan <changlan@google.com>' '--architecture=amd64' '--triggers=@bazel-out/k8-fastbuild/bin/triggers' '--version=0.0.5' '--description=Fast Socket for NCCL 2' '--distribution=unstable' '--urgency=medium' '--recommends=libnccl2')
/usr/bin/env: 'python3': No such file or directory
INFO: Elapsed time: 0.290s, Critical Path: 0.01s
INFO: 2 processes: 2 internal.
FAILED: Build did NOT complete successfully
yselivonchyk commented 2 years ago

Ok, adding a system python solved the issue: apt-get install python3 -y, which is a minor inconvenience.