microsoft / tutel

Tutel MoE: An Optimized Mixture-of-Experts Implementation
MIT License
723 stars 93 forks source link

[installation errors] fatal error: nccl.h: No such file or directory #175

Closed Luodian closed 2 years ago

Luodian commented 2 years ago
running install
/mnt/lustre/anaconda3/envs/scale/lib/python3.9/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
/mnt/lustre/anaconda3/envs/scale/lib/python3.9/site-packages/setuptools/command/easy_install.py:144: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
running bdist_egg
running egg_info
writing tutel.egg-info/PKG-INFO
writing dependency_links to tutel.egg-info/dependency_links.txt
writing requirements to tutel.egg-info/requires.txt
writing top-level names to tutel.egg-info/top_level.txt
reading manifest file 'tutel.egg-info/SOURCES.txt'
adding license file 'LICENSE'
writing manifest file 'tutel.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
creating build
creating build/lib.linux-x86_64-3.9
creating build/lib.linux-x86_64-3.9/tutel
copying tutel/moe.py -> build/lib.linux-x86_64-3.9/tutel
copying tutel/net.py -> build/lib.linux-x86_64-3.9/tutel
copying tutel/__init__.py -> build/lib.linux-x86_64-3.9/tutel
copying tutel/jit.py -> build/lib.linux-x86_64-3.9/tutel
copying tutel/system.py -> build/lib.linux-x86_64-3.9/tutel
creating build/lib.linux-x86_64-3.9/tutel/jit_kernels
copying tutel/jit_kernels/gating.py -> build/lib.linux-x86_64-3.9/tutel/jit_kernels
copying tutel/jit_kernels/__init__.py -> build/lib.linux-x86_64-3.9/tutel/jit_kernels
copying tutel/jit_kernels/sparse.py -> build/lib.linux-x86_64-3.9/tutel/jit_kernels
creating build/lib.linux-x86_64-3.9/tutel/examples
copying tutel/examples/helloworld_ddp.py -> build/lib.linux-x86_64-3.9/tutel/examples
copying tutel/examples/moe_mnist.py -> build/lib.linux-x86_64-3.9/tutel/examples
copying tutel/examples/helloworld.py -> build/lib.linux-x86_64-3.9/tutel/examples
copying tutel/examples/helloworld_ddp_tutel.py -> build/lib.linux-x86_64-3.9/tutel/examples
copying tutel/examples/moe_cifar10.py -> build/lib.linux-x86_64-3.9/tutel/examples
copying tutel/examples/__init__.py -> build/lib.linux-x86_64-3.9/tutel/examples
copying tutel/examples/helloworld_amp.py -> build/lib.linux-x86_64-3.9/tutel/examples
copying tutel/examples/helloworld_from_scratch.py -> build/lib.linux-x86_64-3.9/tutel/examples
copying tutel/examples/helloworld_deepspeed.py -> build/lib.linux-x86_64-3.9/tutel/examples
creating build/lib.linux-x86_64-3.9/tutel/launcher
copying tutel/launcher/execl.py -> build/lib.linux-x86_64-3.9/tutel/launcher
copying tutel/launcher/__init__.py -> build/lib.linux-x86_64-3.9/tutel/launcher
copying tutel/launcher/run.py -> build/lib.linux-x86_64-3.9/tutel/launcher
creating build/lib.linux-x86_64-3.9/tutel/gates
copying tutel/gates/top.py -> build/lib.linux-x86_64-3.9/tutel/gates
copying tutel/gates/__init__.py -> build/lib.linux-x86_64-3.9/tutel/gates
copying tutel/gates/cosine_top.py -> build/lib.linux-x86_64-3.9/tutel/gates
creating build/lib.linux-x86_64-3.9/tutel/custom
copying tutel/custom/__init__.py -> build/lib.linux-x86_64-3.9/tutel/custom
creating build/lib.linux-x86_64-3.9/tutel/experts
copying tutel/experts/ffn.py -> build/lib.linux-x86_64-3.9/tutel/experts
copying tutel/experts/__init__.py -> build/lib.linux-x86_64-3.9/tutel/experts
creating build/lib.linux-x86_64-3.9/tutel/impls
copying tutel/impls/fast_dispatch.py -> build/lib.linux-x86_64-3.9/tutel/impls
copying tutel/impls/moe_layer.py -> build/lib.linux-x86_64-3.9/tutel/impls
copying tutel/impls/losses.py -> build/lib.linux-x86_64-3.9/tutel/impls
copying tutel/impls/jit_compiler.py -> build/lib.linux-x86_64-3.9/tutel/impls
copying tutel/impls/overlap.py -> build/lib.linux-x86_64-3.9/tutel/impls
copying tutel/impls/__init__.py -> build/lib.linux-x86_64-3.9/tutel/impls
copying tutel/impls/communicate.py -> build/lib.linux-x86_64-3.9/tutel/impls
creating build/lib.linux-x86_64-3.9/tutel/parted
copying tutel/parted/patterns.py -> build/lib.linux-x86_64-3.9/tutel/parted
copying tutel/parted/__init__.py -> build/lib.linux-x86_64-3.9/tutel/parted
copying tutel/parted/spmdx.py -> build/lib.linux-x86_64-3.9/tutel/parted
copying tutel/parted/solver.py -> build/lib.linux-x86_64-3.9/tutel/parted
creating build/lib.linux-x86_64-3.9/tutel/parted/backend
copying tutel/parted/backend/__init__.py -> build/lib.linux-x86_64-3.9/tutel/parted/backend
creating build/lib.linux-x86_64-3.9/tutel/parted/backend/torch
copying tutel/parted/backend/torch/executor.py -> build/lib.linux-x86_64-3.9/tutel/parted/backend/torch
copying tutel/parted/backend/torch/__init__.py -> build/lib.linux-x86_64-3.9/tutel/parted/backend/torch
copying tutel/parted/backend/torch/config.py -> build/lib.linux-x86_64-3.9/tutel/parted/backend/torch
running build_ext
building 'tutel_custom_kernel' extension
creating /mnt/lustre/tutel/build/temp.linux-x86_64-3.9
creating /mnt/lustre/tutel/build/temp.linux-x86_64-3.9/tutel
creating /mnt/lustre/tutel/build/temp.linux-x86_64-3.9/tutel/custom
Emitting ninja build file /mnt/lustre/tutel/build/temp.linux-x86_64-3.9/build.ninja...
Compiling objects...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/1] c++ -MMD -MF /mnt/lustre/tutel/build/temp.linux-x86_64-3.9/tutel/custom/custom_kernel.o.d -pthread -B /mnt/lustre/anaconda3/envs/scale/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -O2 -Wall -fPIC -O2 -isystem /mnt/lustre/anaconda3/envs/scale/include -I/mnt/lustre/anaconda3/envs/scale/include -fPIC -O2 -isystem /mnt/lustre/anaconda3/envs/scale/include -fPIC -I/mnt/lustre/anaconda3/envs/scale/lib/python3.9/site-packages/torch/include -I/mnt/lustre/anaconda3/envs/scale/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/mnt/lustre/anaconda3/envs/scale/lib/python3.9/site-packages/torch/include/TH -I/mnt/lustre/anaconda3/envs/scale/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/mnt/lustre/anaconda3/envs/scale/include/python3.9 -c -c /mnt/lustre/tutel/tutel/custom/custom_kernel.cpp -o /mnt/lustre/tutel/build/temp.linux-x86_64-3.9/tutel/custom/custom_kernel.o -Wno-sign-compare -Wno-unused-but-set-variable -Wno-terminate -Wno-unused-function -Wno-strict-aliasing -DUSE_GPU -DUSE_NCCL -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=tutel_custom_kernel -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
FAILED: /mnt/lustre/tutel/build/temp.linux-x86_64-3.9/tutel/custom/custom_kernel.o
c++ -MMD -MF /mnt/lustre/tutel/build/temp.linux-x86_64-3.9/tutel/custom/custom_kernel.o.d -pthread -B /mnt/lustre/anaconda3/envs/scale/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -O2 -Wall -fPIC -O2 -isystem /mnt/lustre/anaconda3/envs/scale/include -I/mnt/lustre/anaconda3/envs/scale/include -fPIC -O2 -isystem /mnt/lustre/anaconda3/envs/scale/include -fPIC -I/mnt/lustre/anaconda3/envs/scale/lib/python3.9/site-packages/torch/include -I/mnt/lustre/anaconda3/envs/scale/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/mnt/lustre/anaconda3/envs/scale/lib/python3.9/site-packages/torch/include/TH -I/mnt/lustre/anaconda3/envs/scale/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/mnt/lustre/anaconda3/envs/scale/include/python3.9 -c -c /mnt/lustre/tutel/tutel/custom/custom_kernel.cpp -o /mnt/lustre/tutel/build/temp.linux-x86_64-3.9/tutel/custom/custom_kernel.o -Wno-sign-compare -Wno-unused-but-set-variable -Wno-terminate -Wno-unused-function -Wno-strict-aliasing -DUSE_GPU -DUSE_NCCL -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=tutel_custom_kernel -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
/mnt/lustre/tutel/tutel/custom/custom_kernel.cpp:19:10: fatal error: nccl.h: No such file or directory
   19 | #include <nccl.h>
      |          ^~~~~~~~
compilation terminated.
ninja: build stopped: subcommand failed.
Try installing without NCCL extension..
running install
/mnt/lustre/anaconda3/envs/scale/lib/python3.9/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
/mnt/lustre/anaconda3/envs/scale/lib/python3.9/site-packages/setuptools/command/easy_install.py:144: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
running bdist_egg
running egg_info
writing tutel.egg-info/PKG-INFO
writing dependency_links to tutel.egg-info/dependency_links.txt
writing requirements to tutel.egg-info/requires.txt
writing top-level names to tutel.egg-info/top_level.txt
reading manifest file 'tutel.egg-info/SOURCES.txt'
adding license file 'LICENSE'
writing manifest file 'tutel.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
copying tutel/moe.py -> build/lib.linux-x86_64-3.9/tutel
Try installing without CUDA extension..
running install
/mnt/lustre/anaconda3/envs/scale/lib/python3.9/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
/mnt/lustre/anaconda3/envs/scale/lib/python3.9/site-packages/setuptools/command/easy_install.py:144: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
running bdist_egg
running egg_info
writing tutel.egg-info/PKG-INFO
writing dependency_links to tutel.egg-info/dependency_links.txt
writing requirements to tutel.egg-info/requires.txt
writing top-level names to tutel.egg-info/top_level.txt
reading manifest file 'tutel.egg-info/SOURCES.txt'
adding license file 'LICENSE'
writing manifest file 'tutel.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
copying tutel/moe.py -> build/lib.linux-x86_64-3.9/tutel
error: could not create 'build/lib.linux-x86_64-3.9/tutel/moe.py': No such file or directory
Luodian commented 2 years ago

I fixed this issue after installing nccl

sudo apt install libnccl2 libnccl-dev