Closed Luodian closed 2 years ago
running install /mnt/lustre/anaconda3/envs/scale/lib/python3.9/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools. warnings.warn( /mnt/lustre/anaconda3/envs/scale/lib/python3.9/site-packages/setuptools/command/easy_install.py:144: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools. warnings.warn( running bdist_egg running egg_info writing tutel.egg-info/PKG-INFO writing dependency_links to tutel.egg-info/dependency_links.txt writing requirements to tutel.egg-info/requires.txt writing top-level names to tutel.egg-info/top_level.txt reading manifest file 'tutel.egg-info/SOURCES.txt' adding license file 'LICENSE' writing manifest file 'tutel.egg-info/SOURCES.txt' installing library code to build/bdist.linux-x86_64/egg running install_lib running build_py creating build creating build/lib.linux-x86_64-3.9 creating build/lib.linux-x86_64-3.9/tutel copying tutel/moe.py -> build/lib.linux-x86_64-3.9/tutel copying tutel/net.py -> build/lib.linux-x86_64-3.9/tutel copying tutel/__init__.py -> build/lib.linux-x86_64-3.9/tutel copying tutel/jit.py -> build/lib.linux-x86_64-3.9/tutel copying tutel/system.py -> build/lib.linux-x86_64-3.9/tutel creating build/lib.linux-x86_64-3.9/tutel/jit_kernels copying tutel/jit_kernels/gating.py -> build/lib.linux-x86_64-3.9/tutel/jit_kernels copying tutel/jit_kernels/__init__.py -> build/lib.linux-x86_64-3.9/tutel/jit_kernels copying tutel/jit_kernels/sparse.py -> build/lib.linux-x86_64-3.9/tutel/jit_kernels creating build/lib.linux-x86_64-3.9/tutel/examples copying tutel/examples/helloworld_ddp.py -> build/lib.linux-x86_64-3.9/tutel/examples copying tutel/examples/moe_mnist.py -> build/lib.linux-x86_64-3.9/tutel/examples copying tutel/examples/helloworld.py -> build/lib.linux-x86_64-3.9/tutel/examples copying tutel/examples/helloworld_ddp_tutel.py -> build/lib.linux-x86_64-3.9/tutel/examples copying tutel/examples/moe_cifar10.py -> build/lib.linux-x86_64-3.9/tutel/examples copying tutel/examples/__init__.py -> build/lib.linux-x86_64-3.9/tutel/examples copying tutel/examples/helloworld_amp.py -> build/lib.linux-x86_64-3.9/tutel/examples copying tutel/examples/helloworld_from_scratch.py -> build/lib.linux-x86_64-3.9/tutel/examples copying tutel/examples/helloworld_deepspeed.py -> build/lib.linux-x86_64-3.9/tutel/examples creating build/lib.linux-x86_64-3.9/tutel/launcher copying tutel/launcher/execl.py -> build/lib.linux-x86_64-3.9/tutel/launcher copying tutel/launcher/__init__.py -> build/lib.linux-x86_64-3.9/tutel/launcher copying tutel/launcher/run.py -> build/lib.linux-x86_64-3.9/tutel/launcher creating build/lib.linux-x86_64-3.9/tutel/gates copying tutel/gates/top.py -> build/lib.linux-x86_64-3.9/tutel/gates copying tutel/gates/__init__.py -> build/lib.linux-x86_64-3.9/tutel/gates copying tutel/gates/cosine_top.py -> build/lib.linux-x86_64-3.9/tutel/gates creating build/lib.linux-x86_64-3.9/tutel/custom copying tutel/custom/__init__.py -> build/lib.linux-x86_64-3.9/tutel/custom creating build/lib.linux-x86_64-3.9/tutel/experts copying tutel/experts/ffn.py -> build/lib.linux-x86_64-3.9/tutel/experts copying tutel/experts/__init__.py -> build/lib.linux-x86_64-3.9/tutel/experts creating build/lib.linux-x86_64-3.9/tutel/impls copying tutel/impls/fast_dispatch.py -> build/lib.linux-x86_64-3.9/tutel/impls copying tutel/impls/moe_layer.py -> build/lib.linux-x86_64-3.9/tutel/impls copying tutel/impls/losses.py -> build/lib.linux-x86_64-3.9/tutel/impls copying tutel/impls/jit_compiler.py -> build/lib.linux-x86_64-3.9/tutel/impls copying tutel/impls/overlap.py -> build/lib.linux-x86_64-3.9/tutel/impls copying tutel/impls/__init__.py -> build/lib.linux-x86_64-3.9/tutel/impls copying tutel/impls/communicate.py -> build/lib.linux-x86_64-3.9/tutel/impls creating build/lib.linux-x86_64-3.9/tutel/parted copying tutel/parted/patterns.py -> build/lib.linux-x86_64-3.9/tutel/parted copying tutel/parted/__init__.py -> build/lib.linux-x86_64-3.9/tutel/parted copying tutel/parted/spmdx.py -> build/lib.linux-x86_64-3.9/tutel/parted copying tutel/parted/solver.py -> build/lib.linux-x86_64-3.9/tutel/parted creating build/lib.linux-x86_64-3.9/tutel/parted/backend copying tutel/parted/backend/__init__.py -> build/lib.linux-x86_64-3.9/tutel/parted/backend creating build/lib.linux-x86_64-3.9/tutel/parted/backend/torch copying tutel/parted/backend/torch/executor.py -> build/lib.linux-x86_64-3.9/tutel/parted/backend/torch copying tutel/parted/backend/torch/__init__.py -> build/lib.linux-x86_64-3.9/tutel/parted/backend/torch copying tutel/parted/backend/torch/config.py -> build/lib.linux-x86_64-3.9/tutel/parted/backend/torch running build_ext building 'tutel_custom_kernel' extension creating /mnt/lustre/tutel/build/temp.linux-x86_64-3.9 creating /mnt/lustre/tutel/build/temp.linux-x86_64-3.9/tutel creating /mnt/lustre/tutel/build/temp.linux-x86_64-3.9/tutel/custom Emitting ninja build file /mnt/lustre/tutel/build/temp.linux-x86_64-3.9/build.ninja... Compiling objects... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [1/1] c++ -MMD -MF /mnt/lustre/tutel/build/temp.linux-x86_64-3.9/tutel/custom/custom_kernel.o.d -pthread -B /mnt/lustre/anaconda3/envs/scale/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -O2 -Wall -fPIC -O2 -isystem /mnt/lustre/anaconda3/envs/scale/include -I/mnt/lustre/anaconda3/envs/scale/include -fPIC -O2 -isystem /mnt/lustre/anaconda3/envs/scale/include -fPIC -I/mnt/lustre/anaconda3/envs/scale/lib/python3.9/site-packages/torch/include -I/mnt/lustre/anaconda3/envs/scale/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/mnt/lustre/anaconda3/envs/scale/lib/python3.9/site-packages/torch/include/TH -I/mnt/lustre/anaconda3/envs/scale/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/mnt/lustre/anaconda3/envs/scale/include/python3.9 -c -c /mnt/lustre/tutel/tutel/custom/custom_kernel.cpp -o /mnt/lustre/tutel/build/temp.linux-x86_64-3.9/tutel/custom/custom_kernel.o -Wno-sign-compare -Wno-unused-but-set-variable -Wno-terminate -Wno-unused-function -Wno-strict-aliasing -DUSE_GPU -DUSE_NCCL -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=tutel_custom_kernel -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14 FAILED: /mnt/lustre/tutel/build/temp.linux-x86_64-3.9/tutel/custom/custom_kernel.o c++ -MMD -MF /mnt/lustre/tutel/build/temp.linux-x86_64-3.9/tutel/custom/custom_kernel.o.d -pthread -B /mnt/lustre/anaconda3/envs/scale/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -O2 -Wall -fPIC -O2 -isystem /mnt/lustre/anaconda3/envs/scale/include -I/mnt/lustre/anaconda3/envs/scale/include -fPIC -O2 -isystem /mnt/lustre/anaconda3/envs/scale/include -fPIC -I/mnt/lustre/anaconda3/envs/scale/lib/python3.9/site-packages/torch/include -I/mnt/lustre/anaconda3/envs/scale/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/mnt/lustre/anaconda3/envs/scale/lib/python3.9/site-packages/torch/include/TH -I/mnt/lustre/anaconda3/envs/scale/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/mnt/lustre/anaconda3/envs/scale/include/python3.9 -c -c /mnt/lustre/tutel/tutel/custom/custom_kernel.cpp -o /mnt/lustre/tutel/build/temp.linux-x86_64-3.9/tutel/custom/custom_kernel.o -Wno-sign-compare -Wno-unused-but-set-variable -Wno-terminate -Wno-unused-function -Wno-strict-aliasing -DUSE_GPU -DUSE_NCCL -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=tutel_custom_kernel -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14 /mnt/lustre/tutel/tutel/custom/custom_kernel.cpp:19:10: fatal error: nccl.h: No such file or directory 19 | #include <nccl.h> | ^~~~~~~~ compilation terminated. ninja: build stopped: subcommand failed. Try installing without NCCL extension.. running install /mnt/lustre/anaconda3/envs/scale/lib/python3.9/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools. warnings.warn( /mnt/lustre/anaconda3/envs/scale/lib/python3.9/site-packages/setuptools/command/easy_install.py:144: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools. warnings.warn( running bdist_egg running egg_info writing tutel.egg-info/PKG-INFO writing dependency_links to tutel.egg-info/dependency_links.txt writing requirements to tutel.egg-info/requires.txt writing top-level names to tutel.egg-info/top_level.txt reading manifest file 'tutel.egg-info/SOURCES.txt' adding license file 'LICENSE' writing manifest file 'tutel.egg-info/SOURCES.txt' installing library code to build/bdist.linux-x86_64/egg running install_lib running build_py copying tutel/moe.py -> build/lib.linux-x86_64-3.9/tutel Try installing without CUDA extension.. running install /mnt/lustre/anaconda3/envs/scale/lib/python3.9/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools. warnings.warn( /mnt/lustre/anaconda3/envs/scale/lib/python3.9/site-packages/setuptools/command/easy_install.py:144: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools. warnings.warn( running bdist_egg running egg_info writing tutel.egg-info/PKG-INFO writing dependency_links to tutel.egg-info/dependency_links.txt writing requirements to tutel.egg-info/requires.txt writing top-level names to tutel.egg-info/top_level.txt reading manifest file 'tutel.egg-info/SOURCES.txt' adding license file 'LICENSE' writing manifest file 'tutel.egg-info/SOURCES.txt' installing library code to build/bdist.linux-x86_64/egg running install_lib running build_py copying tutel/moe.py -> build/lib.linux-x86_64-3.9/tutel error: could not create 'build/lib.linux-x86_64-3.9/tutel/moe.py': No such file or directory
I fixed this issue after installing nccl
sudo apt install libnccl2 libnccl-dev