microsoft / tutel

Tutel MoE: An Optimized Mixture-of-Experts Implementation
MIT License
723 stars 93 forks source link

AttributeError: module 'tutel_custom_kernel' has no attribute 'inject_source' #156

Closed s-kodge closed 2 years ago

s-kodge commented 2 years ago

Library Installation steps. git clone https://github.com/microsoft/tutel --branch main python3 -m pip uninstall tutel -y python3 ./setup.py install --user

Running the example code python3 -m tutel.examples.helloworld --batch_size=16

Results in the following error

[Statistics] param count for MoE local_experts = 16785408, param count for MoE gate = 4096.

ExampleModel(
  (_moe_layer): MOELayer(
    Top-K(s) = ['k=2, noise=0.0'], Total-Experts = 2 [managed by 1 device(s)],
    (experts): FusedExpertsNetwork(model_dim=2048, hidden_size=2048, output_dim=2048, local_experts=2)
    (gates): ModuleList(
      (0): LinearTopKGate(
        (wg): Linear(in_features=2048, out_features=2, bias=False)
      )
    )
  )
)
[Benchmark] world_size = 1, dtype = float32, model_dim = 2048, hidden_size = 2048, samples = 16384, num_local_experts = 2, topK = 2, a2a_ffn_overlap_degree = 1, parallel_type = `auto`, device = `cuda:0`
Traceback (most recent call last):
  File "/home/skodge/anaconda3/envs/moe/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/skodge/anaconda3/envs/moe/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/skodge/tutel/tutel/examples/helloworld.py", line 120, in <module>
    output = model(x)
  File "/home/skodge/anaconda3/envs/moe/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/skodge/tutel/tutel/examples/helloworld.py", line 85, in forward
    result = self._moe_layer(input)
  File "/home/skodge/anaconda3/envs/moe/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/skodge/tutel/tutel/impls/moe_layer.py", line 220, in forward
    logits_dtype, (crit, l_aux) = routing()
  File "/home/skodge/tutel/tutel/impls/moe_layer.py", line 208, in routing
    return logits.dtype, extract_critical(scores,
  File "/home/skodge/tutel/tutel/impls/fast_dispatch.py", line 158, in extract_critical
    locations1 = compute_location(masks_se[0])
  File "/home/skodge/tutel/tutel/jit_kernels/gating.py", line 85, in fast_cumsum_sub_one
    return get_cumsum_kernel(int(data.size(0)), int(data.size(1)))(data)
  File "/home/skodge/tutel/tutel/jit_kernels/gating.py", line 29, in get_cumsum_kernel
    base_kernel = JitCompiler.generate_kernel({'batch_num': global_experts, 'num_samples': samples}, '''
  File "/home/skodge/tutel/tutel/impls/jit_compiler.py", line 31, in generate_kernel
    return JitCompiler.create_raw(template)
  File "/home/skodge/tutel/tutel/impls/jit_compiler.py", line 21, in create_raw
    __ctx__ = tutel_custom_kernel.inject_source(source)
AttributeError: module 'tutel_custom_kernel' has no attribute 'inject_source'
ghostplant commented 2 years ago

Can you show the logs of python3 ./setup.py install --user?

s-kodge commented 2 years ago

Here are the logs. I see an error "/home/skodge/anaconda3/envs/moe/lib/python3.9/site-packages/torch/include/ATen/cuda/CUDAContext.h:5:10: fatal error: 'cuda_runtime_api.h' file not found".

I am using torch 1.11.0 and cuda version in 11.6. What is the version of the pytorch recommended for this package?

running install
running bdist_egg
running egg_info
writing tutel.egg-info/PKG-INFO
writing dependency_links to tutel.egg-info/dependency_links.txt
writing requirements to tutel.egg-info/requires.txt
writing top-level names to tutel.egg-info/top_level.txt
reading manifest file 'tutel.egg-info/SOURCES.txt'
adding license file 'LICENSE'
writing manifest file 'tutel.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
running build_ext
building 'tutel_custom_kernel' extension
/home/skodge/anaconda3/envs/moe/lib/python3.9/site-packages/torch/utils/cpp_extension.py:301: UserWarning: 

                               !! WARNING !!

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Your compiler (c++) is not compatible with the compiler Pytorch was
built with for this platform, which is g++ on linux. Please
use g++ to to compile your extension. Alternatively, you may
compile PyTorch from source using c++, and then you can also use
c++ to compile your extension.

See https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md for help
with compiling PyTorch from source.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

                              !! WARNING !!

  warnings.warn(WRONG_COMPILER_WARNING.format(
Emitting ninja build file /home/skodge/repos/tutel/build/temp.linux-x86_64-3.9/build.ninja...
Compiling objects...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/1] c++ -MMD -MF /home/skodge/repos/tutel/build/temp.linux-x86_64-3.9/tutel/custom/custom_kernel.o.d -pthread -B /home/skodge/anaconda3/envs/moe/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -O2 -Wall -fPIC -O2 -isystem /home/skodge/anaconda3/envs/moe/include -I/home/skodge/anaconda3/envs/moe/include -fPIC -O2 -isystem /home/skodge/anaconda3/envs/moe/include -fPIC -I/home/skodge/anaconda3/envs/moe/lib/python3.9/site-packages/torch/include -I/home/skodge/anaconda3/envs/moe/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/skodge/anaconda3/envs/moe/lib/python3.9/site-packages/torch/include/TH -I/home/skodge/anaconda3/envs/moe/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/skodge/anaconda3/envs/moe/include/python3.9 -c -c /home/skodge/repos/tutel/tutel/custom/custom_kernel.cpp -o /home/skodge/repos/tutel/build/temp.linux-x86_64-3.9/tutel/custom/custom_kernel.o -Wno-sign-compare -Wno-unused-but-set-variable -DUSE_GPU -DUSE_NCCL -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=tutel_custom_kernel -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
FAILED: /home/skodge/repos/tutel/build/temp.linux-x86_64-3.9/tutel/custom/custom_kernel.o 
c++ -MMD -MF /home/skodge/repos/tutel/build/temp.linux-x86_64-3.9/tutel/custom/custom_kernel.o.d -pthread -B /home/skodge/anaconda3/envs/moe/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -O2 -Wall -fPIC -O2 -isystem /home/skodge/anaconda3/envs/moe/include -I/home/skodge/anaconda3/envs/moe/include -fPIC -O2 -isystem /home/skodge/anaconda3/envs/moe/include -fPIC -I/home/skodge/anaconda3/envs/moe/lib/python3.9/site-packages/torch/include -I/home/skodge/anaconda3/envs/moe/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/skodge/anaconda3/envs/moe/lib/python3.9/site-packages/torch/include/TH -I/home/skodge/anaconda3/envs/moe/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/skodge/anaconda3/envs/moe/include/python3.9 -c -c /home/skodge/repos/tutel/tutel/custom/custom_kernel.cpp -o /home/skodge/repos/tutel/build/temp.linux-x86_64-3.9/tutel/custom/custom_kernel.o -Wno-sign-compare -Wno-unused-but-set-variable -DUSE_GPU -DUSE_NCCL -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=tutel_custom_kernel -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
warning: unknown warning option '-Wno-unused-but-set-variable'; did you mean '-Wno-unused-const-variable'? [-Wunknown-warning-option]
In file included from /home/skodge/repos/tutel/tutel/custom/custom_kernel.cpp:7:
/home/skodge/anaconda3/envs/moe/lib/python3.9/site-packages/torch/include/ATen/cuda/CUDAContext.h:5:10: fatal error: 'cuda_runtime_api.h' file not found
#include <cuda_runtime_api.h>
         ^~~~~~~~~~~~~~~~~~~~
1 warning and 1 error generated.
ninja: build stopped: subcommand failed.
Try installing without NCCL extension..
running install
running bdist_egg
running egg_info
writing tutel.egg-info/PKG-INFO
writing dependency_links to tutel.egg-info/dependency_links.txt
writing requirements to tutel.egg-info/requires.txt
writing top-level names to tutel.egg-info/top_level.txt
reading manifest file 'tutel.egg-info/SOURCES.txt'
adding license file 'LICENSE'
writing manifest file 'tutel.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
running build_ext
building 'tutel_custom_kernel' extension
/home/skodge/anaconda3/envs/moe/lib/python3.9/site-packages/torch/utils/cpp_extension.py:301: UserWarning: 

                               !! WARNING !!

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Your compiler (c++) is not compatible with the compiler Pytorch was
built with for this platform, which is g++ on linux. Please
use g++ to to compile your extension. Alternatively, you may
compile PyTorch from source using c++, and then you can also use
c++ to compile your extension.

See https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md for help
with compiling PyTorch from source.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

                              !! WARNING !!

  warnings.warn(WRONG_COMPILER_WARNING.format(
Emitting ninja build file /home/skodge/repos/tutel/build/temp.linux-x86_64-3.9/build.ninja...
Compiling objects...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/1] c++ -MMD -MF /home/skodge/repos/tutel/build/temp.linux-x86_64-3.9/tutel/custom/custom_kernel.o.d -pthread -B /home/skodge/anaconda3/envs/moe/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -O2 -Wall -fPIC -O2 -isystem /home/skodge/anaconda3/envs/moe/include -I/home/skodge/anaconda3/envs/moe/include -fPIC -O2 -isystem /home/skodge/anaconda3/envs/moe/include -fPIC -I/home/skodge/anaconda3/envs/moe/lib/python3.9/site-packages/torch/include -I/home/skodge/anaconda3/envs/moe/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/skodge/anaconda3/envs/moe/lib/python3.9/site-packages/torch/include/TH -I/home/skodge/anaconda3/envs/moe/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/skodge/anaconda3/envs/moe/include/python3.9 -c -c /home/skodge/repos/tutel/tutel/custom/custom_kernel.cpp -o /home/skodge/repos/tutel/build/temp.linux-x86_64-3.9/tutel/custom/custom_kernel.o -Wno-sign-compare -Wno-unused-but-set-variable -DUSE_GPU -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=tutel_custom_kernel -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
FAILED: /home/skodge/repos/tutel/build/temp.linux-x86_64-3.9/tutel/custom/custom_kernel.o 
c++ -MMD -MF /home/skodge/repos/tutel/build/temp.linux-x86_64-3.9/tutel/custom/custom_kernel.o.d -pthread -B /home/skodge/anaconda3/envs/moe/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -O2 -Wall -fPIC -O2 -isystem /home/skodge/anaconda3/envs/moe/include -I/home/skodge/anaconda3/envs/moe/include -fPIC -O2 -isystem /home/skodge/anaconda3/envs/moe/include -fPIC -I/home/skodge/anaconda3/envs/moe/lib/python3.9/site-packages/torch/include -I/home/skodge/anaconda3/envs/moe/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/skodge/anaconda3/envs/moe/lib/python3.9/site-packages/torch/include/TH -I/home/skodge/anaconda3/envs/moe/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/skodge/anaconda3/envs/moe/include/python3.9 -c -c /home/skodge/repos/tutel/tutel/custom/custom_kernel.cpp -o /home/skodge/repos/tutel/build/temp.linux-x86_64-3.9/tutel/custom/custom_kernel.o -Wno-sign-compare -Wno-unused-but-set-variable -DUSE_GPU -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=tutel_custom_kernel -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
warning: unknown warning option '-Wno-unused-but-set-variable'; did you mean '-Wno-unused-const-variable'? [-Wunknown-warning-option]
In file included from /home/skodge/repos/tutel/tutel/custom/custom_kernel.cpp:7:
/home/skodge/anaconda3/envs/moe/lib/python3.9/site-packages/torch/include/ATen/cuda/CUDAContext.h:5:10: fatal error: 'cuda_runtime_api.h' file not found
#include <cuda_runtime_api.h>
         ^~~~~~~~~~~~~~~~~~~~
1 warning and 1 error generated.
ninja: build stopped: subcommand failed.
Try installing without CUDA extension..
running install
running bdist_egg
running egg_info
writing tutel.egg-info/PKG-INFO
writing dependency_links to tutel.egg-info/dependency_links.txt
writing requirements to tutel.egg-info/requires.txt
writing top-level names to tutel.egg-info/top_level.txt
reading manifest file 'tutel.egg-info/SOURCES.txt'
adding license file 'LICENSE'
writing manifest file 'tutel.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
running build_ext
building 'tutel_custom_kernel' extension
/home/skodge/anaconda3/envs/moe/lib/python3.9/site-packages/torch/utils/cpp_extension.py:301: UserWarning: 

                               !! WARNING !!

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Your compiler (c++) is not compatible with the compiler Pytorch was
built with for this platform, which is g++ on linux. Please
use g++ to to compile your extension. Alternatively, you may
compile PyTorch from source using c++, and then you can also use
c++ to compile your extension.

See https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md for help
with compiling PyTorch from source.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

                              !! WARNING !!

  warnings.warn(WRONG_COMPILER_WARNING.format(
Emitting ninja build file /home/skodge/repos/tutel/build/temp.linux-x86_64-3.9/build.ninja...
Compiling objects...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/1] c++ -MMD -MF /home/skodge/repos/tutel/build/temp.linux-x86_64-3.9/tutel/custom/custom_kernel.o.d -pthread -B /home/skodge/anaconda3/envs/moe/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -O2 -Wall -fPIC -O2 -isystem /home/skodge/anaconda3/envs/moe/include -I/home/skodge/anaconda3/envs/moe/include -fPIC -O2 -isystem /home/skodge/anaconda3/envs/moe/include -fPIC -I/home/skodge/anaconda3/envs/moe/lib/python3.9/site-packages/torch/include -I/home/skodge/anaconda3/envs/moe/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/skodge/anaconda3/envs/moe/lib/python3.9/site-packages/torch/include/TH -I/home/skodge/anaconda3/envs/moe/lib/python3.9/site-packages/torch/include/THC -I/home/skodge/anaconda3/envs/moe/include/python3.9 -c -c /home/skodge/repos/tutel/tutel/custom/custom_kernel.cpp -o /home/skodge/repos/tutel/build/temp.linux-x86_64-3.9/tutel/custom/custom_kernel.o -Wno-sign-compare -Wno-unused-but-set-variable -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=tutel_custom_kernel -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
warning: unknown warning option '-Wno-unused-but-set-variable'; did you mean '-Wno-unused-const-variable'? [-Wunknown-warning-option]
1 warning generated.
g++ -pthread -B /home/skodge/anaconda3/envs/moe/compiler_compat -shared -Wl,-rpath,/home/skodge/anaconda3/envs/moe/lib -Wl,-rpath-link,/home/skodge/anaconda3/envs/moe/lib -L/home/skodge/anaconda3/envs/moe/lib -L/home/skodge/anaconda3/envs/moe/lib -Wl,-rpath,/home/skodge/anaconda3/envs/moe/lib -Wl,-rpath-link,/home/skodge/anaconda3/envs/moe/lib -L/home/skodge/anaconda3/envs/moe/lib /home/skodge/repos/tutel/build/temp.linux-x86_64-3.9/./tutel/custom/custom_kernel.o -L/usr/local/cuda/lib64/stubs -L/home/skodge/anaconda3/envs/moe/lib/python3.9/site-packages/torch/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -o build/lib.linux-x86_64-3.9/tutel_custom_kernel.cpython-39-x86_64-linux-gnu.so
creating build/bdist.linux-x86_64/egg
creating build/bdist.linux-x86_64/egg/tutel
copying build/lib.linux-x86_64-3.9/tutel/net.py -> build/bdist.linux-x86_64/egg/tutel
creating build/bdist.linux-x86_64/egg/tutel/jit_kernels
copying build/lib.linux-x86_64-3.9/tutel/jit_kernels/sparse.py -> build/bdist.linux-x86_64/egg/tutel/jit_kernels
copying build/lib.linux-x86_64-3.9/tutel/jit_kernels/gating.py -> build/bdist.linux-x86_64/egg/tutel/jit_kernels
copying build/lib.linux-x86_64-3.9/tutel/jit_kernels/__init__.py -> build/bdist.linux-x86_64/egg/tutel/jit_kernels
creating build/bdist.linux-x86_64/egg/tutel/launcher
copying build/lib.linux-x86_64-3.9/tutel/launcher/__init__.py -> build/bdist.linux-x86_64/egg/tutel/launcher
copying build/lib.linux-x86_64-3.9/tutel/launcher/execl.py -> build/bdist.linux-x86_64/egg/tutel/launcher
copying build/lib.linux-x86_64-3.9/tutel/launcher/run.py -> build/bdist.linux-x86_64/egg/tutel/launcher
copying build/lib.linux-x86_64-3.9/tutel/jit.py -> build/bdist.linux-x86_64/egg/tutel
creating build/bdist.linux-x86_64/egg/tutel/examples
copying build/lib.linux-x86_64-3.9/tutel/examples/moe_cifar10.py -> build/bdist.linux-x86_64/egg/tutel/examples
copying build/lib.linux-x86_64-3.9/tutel/examples/helloworld_ddp_tutel.py -> build/bdist.linux-x86_64/egg/tutel/examples
copying build/lib.linux-x86_64-3.9/tutel/examples/moe_mnist.py -> build/bdist.linux-x86_64/egg/tutel/examples
copying build/lib.linux-x86_64-3.9/tutel/examples/helloworld_deepspeed.py -> build/bdist.linux-x86_64/egg/tutel/examples
copying build/lib.linux-x86_64-3.9/tutel/examples/__init__.py -> build/bdist.linux-x86_64/egg/tutel/examples
copying build/lib.linux-x86_64-3.9/tutel/examples/helloworld_from_scratch.py -> build/bdist.linux-x86_64/egg/tutel/examples
copying build/lib.linux-x86_64-3.9/tutel/examples/helloworld_amp.py -> build/bdist.linux-x86_64/egg/tutel/examples
copying build/lib.linux-x86_64-3.9/tutel/examples/helloworld_ddp.py -> build/bdist.linux-x86_64/egg/tutel/examples
copying build/lib.linux-x86_64-3.9/tutel/examples/helloworld.py -> build/bdist.linux-x86_64/egg/tutel/examples
creating build/bdist.linux-x86_64/egg/tutel/gates
copying build/lib.linux-x86_64-3.9/tutel/gates/top.py -> build/bdist.linux-x86_64/egg/tutel/gates
copying build/lib.linux-x86_64-3.9/tutel/gates/__init__.py -> build/bdist.linux-x86_64/egg/tutel/gates
copying build/lib.linux-x86_64-3.9/tutel/__init__.py -> build/bdist.linux-x86_64/egg/tutel
creating build/bdist.linux-x86_64/egg/tutel/parted
copying build/lib.linux-x86_64-3.9/tutel/parted/__init__.py -> build/bdist.linux-x86_64/egg/tutel/parted
creating build/bdist.linux-x86_64/egg/tutel/parted/backend
creating build/bdist.linux-x86_64/egg/tutel/parted/backend/torch
copying build/lib.linux-x86_64-3.9/tutel/parted/backend/torch/executor.py -> build/bdist.linux-x86_64/egg/tutel/parted/backend/torch
copying build/lib.linux-x86_64-3.9/tutel/parted/backend/torch/__init__.py -> build/bdist.linux-x86_64/egg/tutel/parted/backend/torch
copying build/lib.linux-x86_64-3.9/tutel/parted/backend/torch/config.py -> build/bdist.linux-x86_64/egg/tutel/parted/backend/torch
copying build/lib.linux-x86_64-3.9/tutel/parted/backend/__init__.py -> build/bdist.linux-x86_64/egg/tutel/parted/backend
copying build/lib.linux-x86_64-3.9/tutel/parted/spmdx.py -> build/bdist.linux-x86_64/egg/tutel/parted
copying build/lib.linux-x86_64-3.9/tutel/parted/solver.py -> build/bdist.linux-x86_64/egg/tutel/parted
copying build/lib.linux-x86_64-3.9/tutel/parted/patterns.py -> build/bdist.linux-x86_64/egg/tutel/parted
copying build/lib.linux-x86_64-3.9/tutel/system.py -> build/bdist.linux-x86_64/egg/tutel
creating build/bdist.linux-x86_64/egg/tutel/impls
copying build/lib.linux-x86_64-3.9/tutel/impls/losses.py -> build/bdist.linux-x86_64/egg/tutel/impls
copying build/lib.linux-x86_64-3.9/tutel/impls/fast_dispatch.py -> build/bdist.linux-x86_64/egg/tutel/impls
copying build/lib.linux-x86_64-3.9/tutel/impls/jit_compiler.py -> build/bdist.linux-x86_64/egg/tutel/impls
copying build/lib.linux-x86_64-3.9/tutel/impls/__init__.py -> build/bdist.linux-x86_64/egg/tutel/impls
copying build/lib.linux-x86_64-3.9/tutel/impls/moe_layer.py -> build/bdist.linux-x86_64/egg/tutel/impls
copying build/lib.linux-x86_64-3.9/tutel/impls/communicate.py -> build/bdist.linux-x86_64/egg/tutel/impls
creating build/bdist.linux-x86_64/egg/tutel/experts
copying build/lib.linux-x86_64-3.9/tutel/experts/ffn.py -> build/bdist.linux-x86_64/egg/tutel/experts
copying build/lib.linux-x86_64-3.9/tutel/experts/__init__.py -> build/bdist.linux-x86_64/egg/tutel/experts
copying build/lib.linux-x86_64-3.9/tutel/moe.py -> build/bdist.linux-x86_64/egg/tutel
creating build/bdist.linux-x86_64/egg/tutel/custom
copying build/lib.linux-x86_64-3.9/tutel/custom/__init__.py -> build/bdist.linux-x86_64/egg/tutel/custom
copying build/lib.linux-x86_64-3.9/tutel_custom_kernel.cpython-39-x86_64-linux-gnu.so -> build/bdist.linux-x86_64/egg
byte-compiling build/bdist.linux-x86_64/egg/tutel/net.py to net.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/jit_kernels/sparse.py to sparse.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/jit_kernels/gating.py to gating.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/jit_kernels/__init__.py to __init__.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/launcher/__init__.py to __init__.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/launcher/execl.py to execl.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/launcher/run.py to run.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/jit.py to jit.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/examples/moe_cifar10.py to moe_cifar10.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/examples/helloworld_ddp_tutel.py to helloworld_ddp_tutel.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/examples/moe_mnist.py to moe_mnist.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/examples/helloworld_deepspeed.py to helloworld_deepspeed.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/examples/__init__.py to __init__.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/examples/helloworld_from_scratch.py to helloworld_from_scratch.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/examples/helloworld_amp.py to helloworld_amp.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/examples/helloworld_ddp.py to helloworld_ddp.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/examples/helloworld.py to helloworld.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/gates/top.py to top.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/gates/__init__.py to __init__.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/__init__.py to __init__.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/parted/__init__.py to __init__.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/parted/backend/torch/executor.py to executor.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/parted/backend/torch/__init__.py to __init__.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/parted/backend/torch/config.py to config.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/parted/backend/__init__.py to __init__.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/parted/spmdx.py to spmdx.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/parted/solver.py to solver.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/parted/patterns.py to patterns.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/system.py to system.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/impls/losses.py to losses.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/impls/fast_dispatch.py to fast_dispatch.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/impls/jit_compiler.py to jit_compiler.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/impls/__init__.py to __init__.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/impls/moe_layer.py to moe_layer.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/impls/communicate.py to communicate.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/experts/ffn.py to ffn.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/experts/__init__.py to __init__.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/moe.py to moe.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/custom/__init__.py to __init__.cpython-39.pyc
creating stub loader for tutel_custom_kernel.cpython-39-x86_64-linux-gnu.so
byte-compiling build/bdist.linux-x86_64/egg/tutel_custom_kernel.py to tutel_custom_kernel.cpython-39.pyc
creating build/bdist.linux-x86_64/egg/EGG-INFO
copying tutel.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO
copying tutel.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying tutel.egg-info/dependency_links.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying tutel.egg-info/requires.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying tutel.egg-info/top_level.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
writing build/bdist.linux-x86_64/egg/EGG-INFO/native_libs.txt
zip_safe flag not set; analyzing archive contents...
__pycache__.tutel_custom_kernel.cpython-39: module references __file__
creating 'dist/tutel-0.1-py3.9-linux-x86_64.egg' and adding 'build/bdist.linux-x86_64/egg' to it
removing 'build/bdist.linux-x86_64/egg' (and everything under it)
Processing tutel-0.1-py3.9-linux-x86_64.egg
removing '/home/skodge/.local/lib/python3.9/site-packages/tutel-0.1-py3.9-linux-x86_64.egg' (and everything under it)
creating /home/skodge/.local/lib/python3.9/site-packages/tutel-0.1-py3.9-linux-x86_64.egg
Extracting tutel-0.1-py3.9-linux-x86_64.egg to /home/skodge/.local/lib/python3.9/site-packages
tutel 0.1 is already the active version in easy-install.pth

Installed /home/skodge/.local/lib/python3.9/site-packages/tutel-0.1-py3.9-linux-x86_64.egg
Processing dependencies for tutel==0.1
Finished processing dependencies for tutel==0.1
s-kodge commented 2 years ago

I fixed the issue. Here is what I did to fix the issue.

  1. Include the paths in bashrc. refer to https://github.com/NVIDIA/nccl/issues/131#issuecomment-557546609 This is what I had to add. You need to look in the folder /usr/local/ find the cuda directory where cuda_runtime_api.h is present and add the following 3 lines to .bashrc script (This would depend on your directory).

    export CPATH=/usr/local/cuda-11.6/targets/x86_64-linux/include/:$CPATH
    export LD_LIBRARY_PATH=/usr/local/cuda-11.6/targets/x86_64-linux/lib:$LD_LIBRARY_PATH
    export PATH=/usr/local/cuda-11.6/bin:$PATH

    (At this point the Installation error would be fixed. If you try to run the example code it would still fail stating could not find "cuda_runtime.h". This is because the path to this library is hard-coded in the tutel repository. I had to change the following lines to correct the paths. Please see step 2)

  2. Change the hard-coded paths in following places. (This path could change in further commits. use grep to find "/usr/local/cuda/" grep -r "/usr/local/cuda/" ./) ./tutel/custom/custom_kernel.cpp: lines 94

    const char *entry = "/usr/local/cuda-11.6/bin/nvcc";

    ./tutel/custom/custom_kernel.cpp: lines 121

    std::vector<const char*> param_cstrings = {"--restrict", "--include-path=/usr/local/cuda-11.6/targets/x86_64-linux/include", arch_option.c_str(), "--use_fast_math", "--extra-device-vectorization"};

    ./setup.py: lines 114

    library_dirs=['/usr/local/cuda-11.6/lib64/stubs'],
  3. source the bashrc script. and install the tutel package.

    source ~/.bashrc
    python3 -m pip uninstall tutel -y
    python3 ./setup.py install --user
    python3 -m tutel.examples.helloworld --batch_size=16