microsoft / tutel

Tutel MoE: An Optimized Mixture-of-Experts Implementation
MIT License
723 stars 93 forks source link

Cannot Import JIT optimized kernels? #171

Closed Luodian closed 2 years ago

Luodian commented 2 years ago

I used TUTEL for a while and it works greatly fine. But today I updated my environment and reinstall tutel, I found it crashed during importing module. Do you have any idea on why this happen? Thanks!

>>> from tutel import moe as tutel_moe
Traceback (most recent call last):
  File "/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/tutel/impls/jit_compiler.py", line 8, in <module>
    import tutel_custom_kernel
ImportError: /mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/tutel_custom_kernel.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZNSt15__exception_ptr13exception_ptr9_M_addrefEv

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/tutel/moe.py", line 6, in <module>
    from .jit_kernels.gating import fast_cumsum_sub_one
  File "/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/tutel/jit_kernels/gating.py", line 7, in <module>
    from ..impls.jit_compiler import tutel_custom_kernel
  File "/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/tutel/impls/jit_compiler.py", line 10, in <module>
    raise Exception("Cannot import JIT optimized kernels. Did you forget to install Custom Kernel Extension?")
Exception: Cannot import JIT optimized kernels. Did you forget to install Custom Kernel Extension?
ghostplant commented 2 years ago

Thanks for report, can you purge and reinstall tutel (better to do setup.py from source) and share us the installation logs? We'll check any error messages inside. We can guarantee that the errors above is due to a cpu fallback installation which disables GPU features.

Luodian commented 2 years ago

This is the error when doing manually compilation.

running build_ext
building 'tutel_custom_kernel' extension
creating /mnt/lustre/bli/projects/tutel/build/temp.linux-x86_64-3.9
creating /mnt/lustre/bli/projects/tutel/build/temp.linux-x86_64-3.9/tutel
creating /mnt/lustre/bli/projects/tutel/build/temp.linux-x86_64-3.9/tutel/custom
Emitting ninja build file /mnt/lustre/bli/projects/tutel/build/temp.linux-x86_64-3.9/build.ninja...
Compiling objects...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/1] /mnt/lustre/bli/anaconda3/envs/scale/bin/x86_64-conda-linux-gnu-c++ -MMD -MF /mnt/lustre/bli/projects/tutel/build/temp.linux-x86_64-3.9/tutel/custom/custom_kernel.o.d -Wno-unused-result -Wsign-compare -DNDE
BUG -O2 -Wall -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /mnt/lustre/bli/anaconda3/envs/scale/include -I/mnt/lustre/bli/anaconda3/
envs/scale/include -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /mnt/lustre/bli/anaconda3/envs/scale/include -march=nocona -mtune=ha
swell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /mnt/lustre/bli/anaconda3/envs/scale/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /mnt/lustre/bli/anacond
a3/envs/scale/include -fPIC -I/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/torch/include -I/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/torch/include/torch/csrc/api/include
 -I/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/torch/include/TH -I/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/torch/include/THC -I/mnt/lustre/share/cuda-11.3/include -I/m
nt/lustre/bli/anaconda3/envs/scale/include/python3.9 -c -c /mnt/lustre/bli/projects/tutel/tutel/custom/custom_kernel.cpp -o /mnt/lustre/bli/projects/tutel/build/temp.linux-x86_64-3.9/tutel/custom/custom_kernel.o
-Wno-sign-compare -Wno-unused-but-set-variable -Wno-terminate -Wno-unused-function -DUSE_GPU -DUSE_NCCL -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND
11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=tutel_custom_kernel -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
FAILED: /mnt/lustre/bli/projects/tutel/build/temp.linux-x86_64-3.9/tutel/custom/custom_kernel.o
/mnt/lustre/bli/anaconda3/envs/scale/bin/x86_64-conda-linux-gnu-c++ -MMD -MF /mnt/lustre/bli/projects/tutel/build/temp.linux-x86_64-3.9/tutel/custom/custom_kernel.o.d -Wno-unused-result -Wsign-compare -DNDEBUG -O
2 -Wall -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /mnt/lustre/bli/anaconda3/envs/scale/include -I/mnt/lustre/bli/anaconda3/envs/s
cale/include -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /mnt/lustre/bli/anaconda3/envs/scale/include -march=nocona -mtune=haswell
-ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /mnt/lustre/bli/anaconda3/envs/scale/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /mnt/lustre/bli/anaconda3/env
s/scale/include -fPIC -I/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/torch/include -I/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/mn
t/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/torch/include/TH -I/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/torch/include/THC -I/mnt/lustre/share/cuda-11.3/include -I/mnt/lus
tre/bli/anaconda3/envs/scale/include/python3.9 -c -c /mnt/lustre/bli/projects/tutel/tutel/custom/custom_kernel.cpp -o /mnt/lustre/bli/projects/tutel/build/temp.linux-x86_64-3.9/tutel/custom/custom_kernel.o -Wno-s
ign-compare -Wno-unused-but-set-variable -Wno-terminate -Wno-unused-function -DUSE_GPU -DUSE_NCCL -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUI
LD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=tutel_custom_kernel -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
/mnt/lustre/bli/projects/tutel/tutel/custom/custom_kernel.cpp:19:10: fatal error: nccl.h: No such file or directory
   19 | #include <nccl.h>
      |          ^~~~~~~~
compilation terminated.
ninja: build stopped: subcommand failed.
Try installing without NCCL extension..
running install
/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/setuptools/command/easy_install.py:144: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-ba
sed tools.
  warnings.warn(
running bdist_egg
running egg_info
writing tutel.egg-info/PKG-INFO
writing dependency_links to tutel.egg-info/dependency_links.txt
writing requirements to tutel.egg-info/requires.txt
writing top-level names to tutel.egg-info/top_level.txt
reading manifest file 'tutel.egg-info/SOURCES.txt'
adding license file 'LICENSE'
writing manifest file 'tutel.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
running build_ext
building 'tutel_custom_kernel' extension
Emitting ninja build file /mnt/lustre/bli/projects/tutel/build/temp.linux-x86_64-3.9/build.ninja...
ghostplant commented 2 years ago

OK, but I think it is not complete. It compiles according to this order: (a) CUDA + NCCL -> (b) CUDA -> (c) CPU, while you logs only include part (a).

Luodian commented 2 years ago

Hi I list all the installation information there, and torch, nvcc, gcc version below. Thanks!

running install
/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/setuptools/command/easy_install.py:144: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
running bdist_egg
running egg_info
writing tutel.egg-info/PKG-INFO
writing dependency_links to tutel.egg-info/dependency_links.txt
writing requirements to tutel.egg-info/requires.txt
writing top-level names to tutel.egg-info/top_level.txt
reading manifest file 'tutel.egg-info/SOURCES.txt'
adding license file 'LICENSE'
writing manifest file 'tutel.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
running build_ext
building 'tutel_custom_kernel' extension
Emitting ninja build file /mnt/lustre/bli/projects/tutel/build/temp.linux-x86_64-3.9/build.ninja...
Compiling objects...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/1] /mnt/lustre/bli/anaconda3/envs/scale/bin/x86_64-conda-linux-gnu-c++ -MMD -MF /mnt/lustre/bli/projects/tutel/build/temp.linux-x86_64-3.9/tutel/custom/custom_kernel.o.d -Wno-unused-result -Wsign-compare -DNDEBUG -O2 -Wall -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /mnt/lustre/bli/anaconda3/envs/scale/include -I/mnt/lustre/bli/anaconda3/envs/scale/include -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /mnt/lustre/bli/anaconda3/envs/scale/include -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /mnt/lustre/bli/anaconda3/envs/scale/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /mnt/lustre/bli/anaconda3/envs/scale/include -fPIC -I/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/torch/include -I/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/torch/include/TH -I/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/torch/include/THC -I/mnt/lustre/share/cuda-11.3/include -I/mnt/lustre/bli/anaconda3/envs/scale/include/python3.9 -c -c /mnt/lustre/bli/projects/tutel/tutel/custom/custom_kernel.cpp -o /mnt/lustre/bli/projects/tutel/build/temp.linux-x86_64-3.9/tutel/custom/custom_kernel.o -Wno-sign-compare -Wno-unused-but-set-variable -Wno-terminate -Wno-unused-function -DUSE_GPU -DUSE_NCCL -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=tutel_custom_kernel -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
FAILED: /mnt/lustre/bli/projects/tutel/build/temp.linux-x86_64-3.9/tutel/custom/custom_kernel.o
/mnt/lustre/bli/anaconda3/envs/scale/bin/x86_64-conda-linux-gnu-c++ -MMD -MF /mnt/lustre/bli/projects/tutel/build/temp.linux-x86_64-3.9/tutel/custom/custom_kernel.o.d -Wno-unused-result -Wsign-compare -DNDEBUG -O2 -Wall -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /mnt/lustre/bli/anaconda3/envs/scale/include -I/mnt/lustre/bli/anaconda3/envs/scale/include -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /mnt/lustre/bli/anaconda3/envs/scale/include -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /mnt/lustre/bli/anaconda3/envs/scale/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /mnt/lustre/bli/anaconda3/envs/scale/include -fPIC -I/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/torch/include -I/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/torch/include/TH -I/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/torch/include/THC -I/mnt/lustre/share/cuda-11.3/include -I/mnt/lustre/bli/anaconda3/envs/scale/include/python3.9 -c -c /mnt/lustre/bli/projects/tutel/tutel/custom/custom_kernel.cpp -o /mnt/lustre/bli/projects/tutel/build/temp.linux-x86_64-3.9/tutel/custom/custom_kernel.o -Wno-sign-compare -Wno-unused-but-set-variable -Wno-terminate -Wno-unused-function -DUSE_GPU -DUSE_NCCL -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=tutel_custom_kernel -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
/mnt/lustre/bli/projects/tutel/tutel/custom/custom_kernel.cpp:19:10: fatal error: nccl.h: No such file or directory
   19 | #include <nccl.h>
      |          ^~~~~~~~
compilation terminated.
ninja: build stopped: subcommand failed.
Try installing without NCCL extension..
running install
/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/setuptools/command/easy_install.py:144: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
running bdist_egg
running egg_info
writing tutel.egg-info/PKG-INFO
writing dependency_links to tutel.egg-info/dependency_links.txt
writing requirements to tutel.egg-info/requires.txt
writing top-level names to tutel.egg-info/top_level.txt
reading manifest file 'tutel.egg-info/SOURCES.txt'
adding license file 'LICENSE'
writing manifest file 'tutel.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
running build_ext
building 'tutel_custom_kernel' extension
Emitting ninja build file /mnt/lustre/bli/projects/tutel/build/temp.linux-x86_64-3.9/build.ninja...
Compiling objects...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/1] /mnt/lustre/bli/anaconda3/envs/scale/bin/x86_64-conda-linux-gnu-c++ -MMD -MF /mnt/lustre/bli/projects/tutel/build/temp.linux-x86_64-3.9/tutel/custom/custom_kernel.o.d -Wno-unused-result -Wsign-compare -DNDEBUG -O2 -Wall -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /mnt/lustre/bli/anaconda3/envs/scale/include -I/mnt/lustre/bli/anaconda3/envs/scale/include -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /mnt/lustre/bli/anaconda3/envs/scale/include -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /mnt/lustre/bli/anaconda3/envs/scale/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /mnt/lustre/bli/anaconda3/envs/scale/include -fPIC -I/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/torch/include -I/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/torch/include/TH -I/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/torch/include/THC -I/mnt/lustre/share/cuda-11.3/include -I/mnt/lustre/bli/anaconda3/envs/scale/include/python3.9 -c -c /mnt/lustre/bli/projects/tutel/tutel/custom/custom_kernel.cpp -o /mnt/lustre/bli/projects/tutel/build/temp.linux-x86_64-3.9/tutel/custom/custom_kernel.o -Wno-sign-compare -Wno-unused-but-set-variable -Wno-terminate -Wno-unused-function -DUSE_GPU -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=tutel_custom_kernel -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
/mnt/lustre/bli/anaconda3/envs/scale/bin/x86_64-conda-linux-gnu-c++ -shared -Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections -Wl,-rpath,/mnt/lustre/bli/anaconda3/envs/scale/lib -Wl,-rpath-link,/mnt/lustre/bli/anaconda3/envs/scale/lib -L/mnt/lustre/bli/anaconda3/envs/scale/lib -L/mnt/lustre/bli/anaconda3/envs/scale/lib -Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections -Wl,-rpath,/mnt/lustre/bli/anaconda3/envs/scale/lib -Wl,-rpath-link,/mnt/lustre/bli/anaconda3/envs/scale/lib -L/mnt/lustre/bli/anaconda3/envs/scale/lib -Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections -Wl,-rpath,/mnt/lustre/bli/anaconda3/envs/scale/lib -Wl,-rpath-link,/mnt/lustre/bli/anaconda3/envs/scale/lib -L/mnt/lustre/bli/anaconda3/envs/scale/lib -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /mnt/lustre/bli/anaconda3/envs/scale/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /mnt/lustre/bli/anaconda3/envs/scale/include /mnt/lustre/bli/projects/tutel/build/temp.linux-x86_64-3.9/./tutel/custom/custom_kernel.o -L/usr/local/cuda/lib64/stubs -L/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/torch/lib -L/mnt/lustre/share/cuda-11.3/lib64 -lcuda -lnvrtc -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda_cu -ltorch_cuda_cpp -o build/lib.linux-x86_64-3.9/tutel_custom_kernel.cpython-39-x86_64-linux-gnu.so
creating build/bdist.linux-x86_64/egg
creating build/bdist.linux-x86_64/egg/tutel
copying build/lib.linux-x86_64-3.9/tutel/net.py -> build/bdist.linux-x86_64/egg/tutel
copying build/lib.linux-x86_64-3.9/tutel/jit.py -> build/bdist.linux-x86_64/egg/tutel
creating build/bdist.linux-x86_64/egg/tutel/impls
copying build/lib.linux-x86_64-3.9/tutel/impls/fast_dispatch.py -> build/bdist.linux-x86_64/egg/tutel/impls
copying build/lib.linux-x86_64-3.9/tutel/impls/communicate.py -> build/bdist.linux-x86_64/egg/tutel/impls
copying build/lib.linux-x86_64-3.9/tutel/impls/jit_compiler.py -> build/bdist.linux-x86_64/egg/tutel/impls
copying build/lib.linux-x86_64-3.9/tutel/impls/losses.py -> build/bdist.linux-x86_64/egg/tutel/impls
copying build/lib.linux-x86_64-3.9/tutel/impls/overlap.py -> build/bdist.linux-x86_64/egg/tutel/impls
copying build/lib.linux-x86_64-3.9/tutel/impls/__init__.py -> build/bdist.linux-x86_64/egg/tutel/impls
copying build/lib.linux-x86_64-3.9/tutel/impls/moe_layer.py -> build/bdist.linux-x86_64/egg/tutel/impls
creating build/bdist.linux-x86_64/egg/tutel/parted
creating build/bdist.linux-x86_64/egg/tutel/parted/backend
creating build/bdist.linux-x86_64/egg/tutel/parted/backend/torch
copying build/lib.linux-x86_64-3.9/tutel/parted/backend/torch/config.py -> build/bdist.linux-x86_64/egg/tutel/parted/backend/torch
copying build/lib.linux-x86_64-3.9/tutel/parted/backend/torch/__init__.py -> build/bdist.linux-x86_64/egg/tutel/parted/backend/torch
copying build/lib.linux-x86_64-3.9/tutel/parted/backend/torch/executor.py -> build/bdist.linux-x86_64/egg/tutel/parted/backend/torch
copying build/lib.linux-x86_64-3.9/tutel/parted/backend/__init__.py -> build/bdist.linux-x86_64/egg/tutel/parted/backend
copying build/lib.linux-x86_64-3.9/tutel/parted/solver.py -> build/bdist.linux-x86_64/egg/tutel/parted
copying build/lib.linux-x86_64-3.9/tutel/parted/spmdx.py -> build/bdist.linux-x86_64/egg/tutel/parted
copying build/lib.linux-x86_64-3.9/tutel/parted/__init__.py -> build/bdist.linux-x86_64/egg/tutel/parted
copying build/lib.linux-x86_64-3.9/tutel/parted/patterns.py -> build/bdist.linux-x86_64/egg/tutel/parted
copying build/lib.linux-x86_64-3.9/tutel/moe.py -> build/bdist.linux-x86_64/egg/tutel
creating build/bdist.linux-x86_64/egg/tutel/experts
copying build/lib.linux-x86_64-3.9/tutel/experts/ffn.py -> build/bdist.linux-x86_64/egg/tutel/experts
copying build/lib.linux-x86_64-3.9/tutel/experts/__init__.py -> build/bdist.linux-x86_64/egg/tutel/experts
creating build/bdist.linux-x86_64/egg/tutel/launcher
copying build/lib.linux-x86_64-3.9/tutel/launcher/execl.py -> build/bdist.linux-x86_64/egg/tutel/launcher
copying build/lib.linux-x86_64-3.9/tutel/launcher/__init__.py -> build/bdist.linux-x86_64/egg/tutel/launcher
copying build/lib.linux-x86_64-3.9/tutel/launcher/run.py -> build/bdist.linux-x86_64/egg/tutel/launcher
copying build/lib.linux-x86_64-3.9/tutel/system.py -> build/bdist.linux-x86_64/egg/tutel
creating build/bdist.linux-x86_64/egg/tutel/gates
copying build/lib.linux-x86_64-3.9/tutel/gates/cosine_top.py -> build/bdist.linux-x86_64/egg/tutel/gates
copying build/lib.linux-x86_64-3.9/tutel/gates/__init__.py -> build/bdist.linux-x86_64/egg/tutel/gates
copying build/lib.linux-x86_64-3.9/tutel/gates/top.py -> build/bdist.linux-x86_64/egg/tutel/gates
creating build/bdist.linux-x86_64/egg/tutel/custom
copying build/lib.linux-x86_64-3.9/tutel/custom/__init__.py -> build/bdist.linux-x86_64/egg/tutel/custom
copying build/lib.linux-x86_64-3.9/tutel/__init__.py -> build/bdist.linux-x86_64/egg/tutel
creating build/bdist.linux-x86_64/egg/tutel/jit_kernels
copying build/lib.linux-x86_64-3.9/tutel/jit_kernels/sparse.py -> build/bdist.linux-x86_64/egg/tutel/jit_kernels
copying build/lib.linux-x86_64-3.9/tutel/jit_kernels/gating.py -> build/bdist.linux-x86_64/egg/tutel/jit_kernels
copying build/lib.linux-x86_64-3.9/tutel/jit_kernels/__init__.py -> build/bdist.linux-x86_64/egg/tutel/jit_kernels
creating build/bdist.linux-x86_64/egg/tutel/examples
copying build/lib.linux-x86_64-3.9/tutel/examples/moe_cifar10.py -> build/bdist.linux-x86_64/egg/tutel/examples
copying build/lib.linux-x86_64-3.9/tutel/examples/helloworld.py -> build/bdist.linux-x86_64/egg/tutel/examples
copying build/lib.linux-x86_64-3.9/tutel/examples/helloworld_from_scratch.py -> build/bdist.linux-x86_64/egg/tutel/examples
copying build/lib.linux-x86_64-3.9/tutel/examples/helloworld_amp.py -> build/bdist.linux-x86_64/egg/tutel/examples
copying build/lib.linux-x86_64-3.9/tutel/examples/helloworld_ddp.py -> build/bdist.linux-x86_64/egg/tutel/examples
copying build/lib.linux-x86_64-3.9/tutel/examples/__init__.py -> build/bdist.linux-x86_64/egg/tutel/examples
copying build/lib.linux-x86_64-3.9/tutel/examples/moe_mnist.py -> build/bdist.linux-x86_64/egg/tutel/examples
copying build/lib.linux-x86_64-3.9/tutel/examples/helloworld_ddp_tutel.py -> build/bdist.linux-x86_64/egg/tutel/examples
copying build/lib.linux-x86_64-3.9/tutel/examples/helloworld_deepspeed.py -> build/bdist.linux-x86_64/egg/tutel/examples
copying build/lib.linux-x86_64-3.9/tutel_custom_kernel.cpython-39-x86_64-linux-gnu.so -> build/bdist.linux-x86_64/egg
byte-compiling build/bdist.linux-x86_64/egg/tutel/net.py to net.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/jit.py to jit.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/impls/fast_dispatch.py to fast_dispatch.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/impls/communicate.py to communicate.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/impls/jit_compiler.py to jit_compiler.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/impls/losses.py to losses.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/impls/overlap.py to overlap.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/impls/__init__.py to __init__.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/impls/moe_layer.py to moe_layer.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/parted/backend/torch/config.py to config.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/parted/backend/torch/__init__.py to __init__.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/parted/backend/torch/executor.py to executor.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/parted/backend/__init__.py to __init__.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/parted/solver.py to solver.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/parted/spmdx.py to spmdx.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/parted/__init__.py to __init__.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/parted/patterns.py to patterns.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/moe.py to moe.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/experts/ffn.py to ffn.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/experts/__init__.py to __init__.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/launcher/execl.py to execl.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/launcher/__init__.py to __init__.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/launcher/run.py to run.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/system.py to system.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/gates/cosine_top.py to cosine_top.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/gates/__init__.py to __init__.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/gates/top.py to top.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/custom/__init__.py to __init__.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/__init__.py to __init__.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/jit_kernels/sparse.py to sparse.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/jit_kernels/gating.py to gating.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/jit_kernels/__init__.py to __init__.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/examples/moe_cifar10.py to moe_cifar10.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/examples/helloworld.py to helloworld.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/examples/helloworld_from_scratch.py to helloworld_from_scratch.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/examples/helloworld_amp.py to helloworld_amp.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/examples/helloworld_ddp.py to helloworld_ddp.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/examples/__init__.py to __init__.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/examples/moe_mnist.py to moe_mnist.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/examples/helloworld_ddp_tutel.py to helloworld_ddp_tutel.cpython-39.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/examples/helloworld_deepspeed.py to helloworld_deepspeed.cpython-39.pyc
creating stub loader for tutel_custom_kernel.cpython-39-x86_64-linux-gnu.so
byte-compiling build/bdist.linux-x86_64/egg/tutel_custom_kernel.py to tutel_custom_kernel.cpython-39.pyc
creating build/bdist.linux-x86_64/egg/EGG-INFO
copying tutel.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO
copying tutel.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying tutel.egg-info/dependency_links.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying tutel.egg-info/requires.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying tutel.egg-info/top_level.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
writing build/bdist.linux-x86_64/egg/EGG-INFO/native_libs.txt
zip_safe flag not set; analyzing archive contents...
__pycache__.tutel_custom_kernel.cpython-39: module references __file__
tutel.jit_kernels.__pycache__.gating.cpython-39: module references __file__
creating 'dist/tutel-0.1-py3.9-linux-x86_64.egg' and adding 'build/bdist.linux-x86_64/egg' to it
removing 'build/bdist.linux-x86_64/egg' (and everything under it)
Processing tutel-0.1-py3.9-linux-x86_64.egg
creating /mnt/lustre/bli/.local/lib/python3.9/site-packages/tutel-0.1-py3.9-linux-x86_64.egg
Extracting tutel-0.1-py3.9-linux-x86_64.egg to /mnt/lustre/bli/.local/lib/python3.9/site-packages
Adding tutel 0.1 to easy-install.pth file

Installed /mnt/lustre/bli/.local/lib/python3.9/site-packages/tutel-0.1-py3.9-linux-x86_64.egg
Processing dependencies for tutel==0.1
Finished processing dependencies for tutel==0.1
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_May__3_19:15:13_PDT_2021
Cuda compilation tools, release 11.3, V11.3.109
Build cuda_11.3.r11.3/compiler.29920130_0
1.12.0+cu113
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/mnt/lustre/bli/anaconda3/envs/scale/bin/../libexec/gcc/x86_64-unknown-linux-gnu/5.4.0/lto-wrapper
Target: x86_64-unknown-linux-gnu
Configured with: ./configure --prefix=/home/conda/miniconda2/conda-bld/gcc_1497274051834/_b_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeho --with-gxx-include-dir=/home/conda/miniconda2/conda-bld/gcc_1497274051834/_b_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeho/gcc/include/c++ --enable-checking=release --enable-languages=c,c++,fortran --disable-multilib
Thread model: posix
gcc version 5.4.0 (GCC)
Luodian commented 2 years ago

I also tried install with previous build & release. But this problem is still here, is that because my system environment has changed? Thanks for your kind help!

ghostplant commented 2 years ago

In a fresh python shell, can you try following code?

>> import torch
>> import tutel_custom_kernel
Luodian commented 2 years ago

Can not😢


Python 3.9.12 (main, Jun  1 2022, 11:38:51)
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> import tutel_custom_kernel
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: /mnt/lustre/bli/.local/lib/python3.9/site-packages/tutel-0.1-py3.9-linux-x86_64.egg/tutel_custom_kernel.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZNSt15__exception_ptr13exception_ptr9_M_addrefEv
ghostplant commented 2 years ago

@Luodian This error is because previous uninstallation is not clean. Please repeat multiple times for this command until the pip manager reports no any previous tutel version exists:

python3 -m pip uninstall tutel -y
Luodian commented 2 years ago

Hi Thanks! I tried this way but the error still exists. But I realize the maybe it's because I tried install tutel in another conda environment without --user flag and failed.

Now I am purging all the packages and reinstalling anaconda. Hope this would work.

Luodian commented 2 years ago

I reinstall everything and it works fine. Thanks for your patience, it helps me a lot.

ghostplant commented 2 years ago

Thanks for your information. I'll close this issue as it is solved.