Closed yselivonchyk closed 2 years ago
Thanks for reporting. Should be fixed in https://github.com/google/nccl-fastsocket/commit/284e8279afd46370401348101ed8e5f7168c9e6d. Can you verify?
Hi, I am still facing issues building the package.
(base) root@50cc2b63176d:/nccl-fastsocket# bazel build :all
Extracting Bazel installation...
Starting local Bazel server and connecting to it...
INFO: SHA256 (https://github.com/bazelbuild/rules_pkg/archive/main.zip) = e50157bd9b1fae89a629bfe0c1d2712ff96d07e9b68ec9c9da3a0b8ce9d1983f
DEBUG: Rule 'rules_pkg' indicated that a canonical reproducible form can be obtained by modifying arguments sha256 = "e50157bd9b1fae89a629bfe0c1d2712ff96d07e9b68ec9c9da3a0b8ce9d1983f"
DEBUG: Repository rules_pkg instantiated at:
/nccl-fastsocket/WORKSPACE.bazel:25:13: in <toplevel>
Repository rule http_archive defined at:
/root/.cache/bazel/_bazel_root/002cb6c7007d2c87f04621a3d83dea3b/external/bazel_tools/tools/build_defs/repo/http.bzl:336:31: in <toplevel>
ERROR: /root/.cache/bazel/_bazel_root/002cb6c7007d2c87f04621a3d83dea3b/external/rules_pkg/pkg/private/pkg_files.bzl:377:12: name 'json' is not defined
ERROR: Skipping ':all': while parsing ':all': error loading package '': in /root/.cache/bazel/_bazel_root/002cb6c7007d2c87f04621a3d83dea3b/external/rules_pkg/pkg.bzl: in /root/.cache/bazel/_bazel_root/002cb6c7007d2c87f04621a3d83dea3b/external/rules_pkg/pkg/zip.bzl: in /root/.cache/bazel/_bazel_root/002cb6c7007d2c87f04621a3d83dea3b/external/rules_pkg/pkg/private/zip/zip.bzl: Extension 'pkg/private/pkg_files.bzl' has errors
WARNING: Target pattern parsing failed.
ERROR: while parsing ':all': error loading package '': in /root/.cache/bazel/_bazel_root/002cb6c7007d2c87f04621a3d83dea3b/external/rules_pkg/pkg.bzl: in /root/.cache/bazel/_bazel_root/002cb6c7007d2c87f04621a3d83dea3b/external/rules_pkg/pkg/zip.bzl: in /root/.cache/bazel/_bazel_root/002cb6c7007d2c87f04621a3d83dea3b/external/rules_pkg/pkg/private/zip/zip.bzl: Extension 'pkg/private/pkg_files.bzl' has errors
INFO: Elapsed time: 14.797s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (1 packages loaded)
(base) root@50cc2b63176d:/nccl-fastsocket#
The easy way to reproduce is run docker:
docker pull nvidia/cuda
docker run -it nvidia/cuda /bin/bash
Try build:
apt-get update; apt-get install wget -y
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && \
chmod +x Miniconda3-latest-Linux-x86_64.sh && \
./Miniconda3-latest-Linux-x86_64.sh -b -p && \
source /root/miniconda3/bin/activate && \
python -m pip install --upgrade pip
conda install git bazel -y && \
git clone https://github.com/google/nccl-fastsocket && \
cd nccl-fastsocket && \
bazel build :all
With updated bazel things got different.
Installation:
apt-get update; apt-get install wget -y
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && \
chmod +x Miniconda3-latest-Linux-x86_64.sh && \
./Miniconda3-latest-Linux-x86_64.sh -b -p && \
source /root/miniconda3/bin/activate && \
python -m pip install --upgrade pip && \
conda install git -y && \
&& \
apt install apt-transport-https curl gnupg -y && \
curl -fsSL https://bazel.build/bazel-release.pub.gpg | gpg --dearmor > bazel.gpg && \
mv bazel.gpg /etc/apt/trusted.gpg.d/ && \
echo "deb [arch=amd64] https://storage.googleapis.com/bazel-apt stable jdk1.8" | tee /etc/apt/sources.list.d/bazel.list && \
apt update && apt install bazel && \
&& \
git clone https://github.com/google/nccl-fastsocket && \
cd nccl-fastsocket && \
bazel build :all
Result:
(base) root@3e1dbf935b8d:/# git clone https://github.com/google/nccl-fastsocket && \
> cd nccl-fastsocket && \
> bazel build :all
Cloning into 'nccl-fastsocket'...
remote: Enumerating objects: 46, done.
remote: Counting objects: 100% (46/46), done.
remote: Compressing objects: 100% (31/31), done.
remote: Total 46 (delta 22), reused 37 (delta 13), pack-reused 0
Receiving objects: 100% (46/46), 30.26 KiB | 1.16 MiB/s, done.
Resolving deltas: 100% (22/22), done.
Starting local Bazel server and connecting to it...
DEBUG: Rule 'rules_pkg' indicated that a canonical reproducible form can be obtained by modifying arguments sha256 = "e50157bd9b1fae89a629bfe0c1d2712ff96d07e9b68ec9c9da3a0b8ce9d1983f"
DEBUG: Repository rules_pkg instantiated at:
/nccl-fastsocket/WORKSPACE.bazel:25:13: in <toplevel>
Repository rule http_archive defined at:
/root/.cache/bazel/_bazel_root/002cb6c7007d2c87f04621a3d83dea3b/external/bazel_tools/tools/build_defs/repo/http.bzl:364:31: in <toplevel>
DEBUG: Rule 'nccl' indicated that a canonical reproducible form can be obtained by modifying arguments commit = "7e515921295adaab72adf56ea71a0fafb0ecb5f3", shallow_since = "1625779814 -0700" and dropping ["tag"]
DEBUG: Repository nccl instantiated at:
/nccl-fastsocket/WORKSPACE.bazel:8:6: in <toplevel>
/root/.cache/bazel/_bazel_root/002cb6c7007d2c87f04621a3d83dea3b/external/bazel_tools/tools/build_defs/repo/utils.bzl:233:18: in maybe
Repository rule new_git_repository defined at:
/root/.cache/bazel/_bazel_root/002cb6c7007d2c87f04621a3d83dea3b/external/bazel_tools/tools/build_defs/repo/git.bzl:186:37: in <toplevel>
INFO: Analyzed 8 targets (46 packages loaded, 1819 targets configured).
INFO: Found 8 targets...
INFO: From Compiling utilities.cc:
In file included from utilities.cc:1:
utilities.h:659:21: warning: 'ncclResult_t socketRecv(int, socketAddress*, void*, int)' defined but not used [-Wunused-function]
659 | static ncclResult_t socketRecv(int fd, union socketAddress* addr, void* ptr,
| ^~~~~~~~~~
utilities.h:648:21: warning: 'ncclResult_t socketSend(int, socketAddress*, void*, int)' defined but not used [-Wunused-function]
648 | static ncclResult_t socketSend(int fd, union socketAddress* addr, void* ptr,
| ^~~~~~~~~~
utilities.h:637:21: warning: 'ncclResult_t socketWait(int, int, socketAddress*, void*, int, int*)' defined but not used [-Wunused-function]
637 | static ncclResult_t socketWait(int op, int fd, union socketAddress* addr,
| ^~~~~~~~~~
utilities.h:625:21: warning: 'ncclResult_t socketProgress(int, int, socketAddress*, void*, int, int*)' defined but not used [-Wunused-function]
625 | static ncclResult_t socketProgress(int op, int fd, union socketAddress* addr,
| ^~~~~~~~~~~~~~
utilities.h:551:21: warning: 'ncclResult_t connectAddress(int*, socketAddress*)' defined but not used [-Wunused-function]
551 | static ncclResult_t connectAddress(int* fd, union socketAddress* remoteAddr) {
| ^~~~~~~~~~~~~~
utilities.h:511:21: warning: 'ncclResult_t createListenSocket(int*, socketAddress*)' defined but not used [-Wunused-function]
511 | static ncclResult_t createListenSocket(int *fd, union socketAddress *localAddr) {
| ^~~~~~~~~~~~~~~~~~
utilities.h:475:12: warning: 'int findInterfaces(char*, socketAddress*, int, int)' defined but not used [-Wunused-function]
475 | static int findInterfaces(char* ifNames, union socketAddress *ifAddrs, int ifNameMaxSize, int maxIfs) {
| ^~~~~~~~~~~~~~
INFO: From Compiling net_fastsocket.cc:
net_fastsocket.cc: In function 'void* persistentSocketThread(void*)':
net_fastsocket.cc:1184:19: warning: comparison of integer expressions of different signedness: 'int' and 'std::__atomic_base<unsigned int>::__int_type' {aka 'unsigned int'} [-Wsign-compare]
1184 | while (mark == resource->next && *state != stop) { // no new tasks, wait
| ~~~~~^~~~~~~~~~~~~~~~~
net_fastsocket.cc: In instantiation of 'ncclResult_t ncclBufferedSendSocket<BUF_SIZE>::send(void*, int) [with unsigned int BUF_SIZE = 128]':
net_fastsocket.cc:1277:3: required from here
net_fastsocket.cc:329:11: warning: comparison of integer expressions of different signedness: 'int' and 'unsigned int' [-Wsign-compare]
329 | if (s > BUF_SIZE) return ncclInternalError;
| ~~^~~~~~~~~~
net_fastsocket.cc:330:17: warning: comparison of integer expressions of different signedness: 'int' and 'unsigned int' [-Wsign-compare]
330 | if (cur + s > BUF_SIZE) NCCLCHECK(sync());
| ~~~~~~~~^~~~~~~~~~
In file included from net_fastsocket.cc:36:
utilities.h: At global scope:
utilities.h:659:21: warning: 'ncclResult_t socketRecv(int, socketAddress*, void*, int)' defined but not used [-Wunused-function]
659 | static ncclResult_t socketRecv(int fd, union socketAddress* addr, void* ptr,
| ^~~~~~~~~~
utilities.h:648:21: warning: 'ncclResult_t socketSend(int, socketAddress*, void*, int)' defined but not used [-Wunused-function]
648 | static ncclResult_t socketSend(int fd, union socketAddress* addr, void* ptr,
| ^~~~~~~~~~
utilities.h:637:21: warning: 'ncclResult_t socketWait(int, int, socketAddress*, void*, int, int*)' defined but not used [-Wunused-function]
637 | static ncclResult_t socketWait(int op, int fd, union socketAddress* addr,
| ^~~~~~~~~~
utilities.h:625:21: warning: 'ncclResult_t socketProgress(int, int, socketAddress*, void*, int, int*)' defined but not used [-Wunused-function]
625 | static ncclResult_t socketProgress(int op, int fd, union socketAddress* addr,
| ^~~~~~~~~~~~~~
ERROR: /nccl-fastsocket/BUILD:63:8: MakeDeb google-fast-socket_0.0.5_amd64.deb failed: (Exit 127): make_deb failed: error executing command bazel-out/k8-opt-exec-2B5CBBC6/bin/external/rules_pkg/pkg/private/deb/make_deb '--output=bazel-out/k8-fastbuild/bin/google-fast-socket_0.0.5_amd64.deb' ... (remaining 11 arguments skipped)
Use --sandbox_debug to see verbose messages from the sandbox
/usr/bin/env: 'python3': No such file or directory
INFO: Elapsed time: 18.909s, Critical Path: 1.08s
INFO: 34 processes: 19 internal, 15 processwrapper-sandbox.
FAILED: Build did NOT complete successfully
With the recommended option it turned into:
(base) root@3e1dbf935b8d:/nccl-fastsocket# bazel build :all --sandbox_debug
DEBUG: Rule 'rules_pkg' indicated that a canonical reproducible form can be obtained by modifying arguments sha256 = "e50157bd9b1fae89a629bfe0c1d2712ff96d07e9b68ec9c9da3a0b8ce9d1983f"
DEBUG: Repository rules_pkg instantiated at:
/nccl-fastsocket/WORKSPACE.bazel:25:13: in <toplevel>
Repository rule http_archive defined at:
/root/.cache/bazel/_bazel_root/002cb6c7007d2c87f04621a3d83dea3b/external/bazel_tools/tools/build_defs/repo/http.bzl:364:31: in <toplevel>
DEBUG: Rule 'nccl' indicated that a canonical reproducible form can be obtained by modifying arguments commit = "7e515921295adaab72adf56ea71a0fafb0ecb5f3", shallow_since = "1625779814 -0700" and dropping ["tag"]
DEBUG: Repository nccl instantiated at:
/nccl-fastsocket/WORKSPACE.bazel:8:6: in <toplevel>
/root/.cache/bazel/_bazel_root/002cb6c7007d2c87f04621a3d83dea3b/external/bazel_tools/tools/build_defs/repo/utils.bzl:233:18: in maybe
Repository rule new_git_repository defined at:
/root/.cache/bazel/_bazel_root/002cb6c7007d2c87f04621a3d83dea3b/external/bazel_tools/tools/build_defs/repo/git.bzl:186:37: in <toplevel>
INFO: Analyzed 8 targets (0 packages loaded, 0 targets configured).
INFO: Found 8 targets...
ERROR: /nccl-fastsocket/BUILD:63:8: MakeDeb google-fast-socket_0.0.5_amd64.deb failed: (Exit 127): process-wrapper failed: error executing command
(cd /root/.cache/bazel/_bazel_root/002cb6c7007d2c87f04621a3d83dea3b/sandbox/processwrapper-sandbox/18/execroot/fastsocket && \
exec env - \
LANG=en_US.UTF-8 \
LC_CTYPE=UTF-8 \
PYTHONIOENCODING=UTF-8 \
PYTHONUTF8=1 \
TMPDIR=/tmp \
/root/.cache/bazel/_bazel_root/install/c87283ec3a7822eea44f4cecb6db792e/process-wrapper '--timeout=0' '--kill_delay=15' bazel-out/k8-opt-exec-2B5CBBC6/bin/external/rules_pkg/pkg/private/deb/make_deb '--output=bazel-out/k8-fastbuild/bin/google-fast-socket_0.0.5_amd64.deb' '--changes=bazel-out/k8-fastbuild/bin/google-fast-socket_0.0.5_amd64.changes' '--data=bazel-out/k8-fastbuild/bin/tarball.tar.gz' '--package=google-fast-socket' '--maintainer=Chang Lan <changlan@google.com>' '--architecture=amd64' '--triggers=@bazel-out/k8-fastbuild/bin/triggers' '--version=0.0.5' '--description=Fast Socket for NCCL 2' '--distribution=unstable' '--urgency=medium' '--recommends=libnccl2')
/usr/bin/env: 'python3': No such file or directory
INFO: Elapsed time: 0.290s, Critical Path: 0.01s
INFO: 2 processes: 2 internal.
FAILED: Build did NOT complete successfully(base) root@3e1dbf935b8d:/nccl-fastsocket# bazel build :all --sandbox_debug
DEBUG: Rule 'rules_pkg' indicated that a canonical reproducible form can be obtained by modifying arguments sha256 = "e50157bd9b1fae89a629bfe0c1d2712ff96d07e9b68ec9c9da3a0b8ce9d1983f"
DEBUG: Repository rules_pkg instantiated at:
/nccl-fastsocket/WORKSPACE.bazel:25:13: in <toplevel>
Repository rule http_archive defined at:
/root/.cache/bazel/_bazel_root/002cb6c7007d2c87f04621a3d83dea3b/external/bazel_tools/tools/build_defs/repo/http.bzl:364:31: in <toplevel>
DEBUG: Rule 'nccl' indicated that a canonical reproducible form can be obtained by modifying arguments commit = "7e515921295adaab72adf56ea71a0fafb0ecb5f3", shallow_since = "1625779814 -0700" and dropping ["tag"]
DEBUG: Repository nccl instantiated at:
/nccl-fastsocket/WORKSPACE.bazel:8:6: in <toplevel>
/root/.cache/bazel/_bazel_root/002cb6c7007d2c87f04621a3d83dea3b/external/bazel_tools/tools/build_defs/repo/utils.bzl:233:18: in maybe
Repository rule new_git_repository defined at:
/root/.cache/bazel/_bazel_root/002cb6c7007d2c87f04621a3d83dea3b/external/bazel_tools/tools/build_defs/repo/git.bzl:186:37: in <toplevel>
INFO: Analyzed 8 targets (0 packages loaded, 0 targets configured).
INFO: Found 8 targets...
ERROR: /nccl-fastsocket/BUILD:63:8: MakeDeb google-fast-socket_0.0.5_amd64.deb failed: (Exit 127): process-wrapper failed: error executing command
(cd /root/.cache/bazel/_bazel_root/002cb6c7007d2c87f04621a3d83dea3b/sandbox/processwrapper-sandbox/18/execroot/fastsocket && \
exec env - \
LANG=en_US.UTF-8 \
LC_CTYPE=UTF-8 \
PYTHONIOENCODING=UTF-8 \
PYTHONUTF8=1 \
TMPDIR=/tmp \
/root/.cache/bazel/_bazel_root/install/c87283ec3a7822eea44f4cecb6db792e/process-wrapper '--timeout=0' '--kill_delay=15' bazel-out/k8-opt-exec-2B5CBBC6/bin/external/rules_pkg/pkg/private/deb/make_deb '--output=bazel-out/k8-fastbuild/bin/google-fast-socket_0.0.5_amd64.deb' '--changes=bazel-out/k8-fastbuild/bin/google-fast-socket_0.0.5_amd64.changes' '--data=bazel-out/k8-fastbuild/bin/tarball.tar.gz' '--package=google-fast-socket' '--maintainer=Chang Lan <changlan@google.com>' '--architecture=amd64' '--triggers=@bazel-out/k8-fastbuild/bin/triggers' '--version=0.0.5' '--description=Fast Socket for NCCL 2' '--distribution=unstable' '--urgency=medium' '--recommends=libnccl2')
/usr/bin/env: 'python3': No such file or directory
INFO: Elapsed time: 0.290s, Critical Path: 0.01s
INFO: 2 processes: 2 internal.
FAILED: Build did NOT complete successfully
Ok, adding a system python solved the issue: apt-get install python3 -y
, which is a minor inconvenience.
With working installation of cuda:
Build doesnt get too far: