Closed LisaWang0306 closed 2 years ago
I don't know whether python & python3 command target to the same python runtime in your environment.
Can you try using python3 ./tutel/setup.py install --user
instead of python ./tutel/setup.py install --user
to install tutel?
I don't know whether python & python3 command target to the same python runtime in your environment. Can you try using
python3 ./tutel/setup.py install --user
instead ofpython ./tutel/setup.py install --user
to install tutel?
Sorry I made a mistake, for the tutorial I still used python ./tutel/examples/helloworld.py --batch_size=16
to run.
For this case, do you have any idea why it came up with this error?
Thanks very much for your reply!!
Firstly, can you run "python -m pip uninstall tutel" many times to ensure it is fully cleaned?
Then, can you run and share the output logs of that installation command python ./tutel/setup.py install --user
? Thanks!
Firstly, can you run "python -m pip uninstall tutel" many times to ensure it is fully cleaned?
Then, can you run and share the output logs of that installation command
python ./tutel/setup.py install --user
? Thanks!
Thanks for your suggestions!
I run python -m pip uninstall tutel
for three times and it already shows WARNING: Skipping tutel as it is not installed.
The complete output log of the installation command python ./tutel/setup.py install --user
is as follow:
(en1) fanj@worker124:~$ python ./tutel/setup.py install --user
running install
running bdist_egg
running egg_info
writing tutel.egg-info/PKG-INFO
writing dependency_links to tutel.egg-info/dependency_links.txt
writing requirements to tutel.egg-info/requires.txt
writing top-level names to tutel.egg-info/top_level.txt
/home/fanj/.local/lib/python3.6/site-packages/torch/utils/cpp_extension.py:381: UserWarning: Attempted to use ninja as the BuildExtension backend but we could not find ninja.. Falling back to using the slow distutils backend.
warnings.warn(msg.format('we could not find ninja.'))
reading manifest file 'tutel.egg-info/SOURCES.txt'
writing manifest file 'tutel.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
running build_ext
creating build/bdist.linux-x86_64/egg
copying build/lib.linux-x86_64-3.6/tutel_custom_kernel.cpython-36m-x86_64-linux-gnu.so -> build/bdist.linux-x86_64/egg
creating build/bdist.linux-x86_64/egg/tutel
copying build/lib.linux-x86_64-3.6/tutel/__init__.py -> build/bdist.linux-x86_64/egg/tutel
creating build/bdist.linux-x86_64/egg/tutel/parted
copying build/lib.linux-x86_64-3.6/tutel/parted/__init__.py -> build/bdist.linux-x86_64/egg/tutel/parted
copying build/lib.linux-x86_64-3.6/tutel/parted/solver.py -> build/bdist.linux-x86_64/egg/tutel/parted
copying build/lib.linux-x86_64-3.6/tutel/parted/patterns.py -> build/bdist.linux-x86_64/egg/tutel/parted
copying build/lib.linux-x86_64-3.6/tutel/parted/spmdx.py -> build/bdist.linux-x86_64/egg/tutel/parted
creating build/bdist.linux-x86_64/egg/tutel/parted/backend
copying build/lib.linux-x86_64-3.6/tutel/parted/backend/__init__.py -> build/bdist.linux-x86_64/egg/tutel/parted/backend
creating build/bdist.linux-x86_64/egg/tutel/parted/backend/torch
copying build/lib.linux-x86_64-3.6/tutel/parted/backend/torch/__init__.py -> build/bdist.linux-x86_64/egg/tutel/parted/backend/torch
copying build/lib.linux-x86_64-3.6/tutel/parted/backend/torch/executor.py -> build/bdist.linux-x86_64/egg/tutel/parted/backend/torch
copying build/lib.linux-x86_64-3.6/tutel/parted/backend/torch/config.py -> build/bdist.linux-x86_64/egg/tutel/parted/backend/torch
creating build/bdist.linux-x86_64/egg/tutel/jit_kernels
copying build/lib.linux-x86_64-3.6/tutel/jit_kernels/__init__.py -> build/bdist.linux-x86_64/egg/tutel/jit_kernels
copying build/lib.linux-x86_64-3.6/tutel/jit_kernels/sparse.py -> build/bdist.linux-x86_64/egg/tutel/jit_kernels
copying build/lib.linux-x86_64-3.6/tutel/jit_kernels/gating.py -> build/bdist.linux-x86_64/egg/tutel/jit_kernels
copying build/lib.linux-x86_64-3.6/tutel/moe.py -> build/bdist.linux-x86_64/egg/tutel
copying build/lib.linux-x86_64-3.6/tutel/system_init.py -> build/bdist.linux-x86_64/egg/tutel
creating build/bdist.linux-x86_64/egg/tutel/custom
copying build/lib.linux-x86_64-3.6/tutel/custom/__init__.py -> build/bdist.linux-x86_64/egg/tutel/custom
creating build/bdist.linux-x86_64/egg/tutel/impls
copying build/lib.linux-x86_64-3.6/tutel/impls/__init__.py -> build/bdist.linux-x86_64/egg/tutel/impls
copying build/lib.linux-x86_64-3.6/tutel/impls/fast_dispatch.py -> build/bdist.linux-x86_64/egg/tutel/impls
copying build/lib.linux-x86_64-3.6/tutel/impls/jit_compiler.py -> build/bdist.linux-x86_64/egg/tutel/impls
copying build/lib.linux-x86_64-3.6/tutel/impls/communicate.py -> build/bdist.linux-x86_64/egg/tutel/impls
copying build/lib.linux-x86_64-3.6/tutel/impls/moe_layer.py -> build/bdist.linux-x86_64/egg/tutel/impls
creating build/bdist.linux-x86_64/egg/tutel/launcher
copying build/lib.linux-x86_64-3.6/tutel/launcher/__init__.py -> build/bdist.linux-x86_64/egg/tutel/launcher
copying build/lib.linux-x86_64-3.6/tutel/launcher/execl.py -> build/bdist.linux-x86_64/egg/tutel/launcher
copying build/lib.linux-x86_64-3.6/tutel/launcher/run.py -> build/bdist.linux-x86_64/egg/tutel/launcher
creating build/bdist.linux-x86_64/egg/tutel/examples
copying build/lib.linux-x86_64-3.6/tutel/examples/__init__.py -> build/bdist.linux-x86_64/egg/tutel/examples
copying build/lib.linux-x86_64-3.6/tutel/examples/helloworld_deepspeed.py -> build/bdist.linux-x86_64/egg/tutel/examples
copying build/lib.linux-x86_64-3.6/tutel/examples/helloworld.py -> build/bdist.linux-x86_64/egg/tutel/examples
copying build/lib.linux-x86_64-3.6/tutel/examples/helloworld_megatron.py -> build/bdist.linux-x86_64/egg/tutel/examples
copying build/lib.linux-x86_64-3.6/tutel/examples/helloworld_ddp.py -> build/bdist.linux-x86_64/egg/tutel/examples
copying build/lib.linux-x86_64-3.6/tutel/examples/helloworld_amp.py -> build/bdist.linux-x86_64/egg/tutel/examples
copying build/lib.linux-x86_64-3.6/tutel/examples/helloworld_sharded_experts.py -> build/bdist.linux-x86_64/egg/tutel/examples
byte-compiling build/bdist.linux-x86_64/egg/tutel/__init__.py to __init__.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/parted/__init__.py to __init__.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/parted/solver.py to solver.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/parted/patterns.py to patterns.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/parted/spmdx.py to spmdx.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/parted/backend/__init__.py to __init__.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/parted/backend/torch/__init__.py to __init__.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/parted/backend/torch/executor.py to executor.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/parted/backend/torch/config.py to config.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/jit_kernels/__init__.py to __init__.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/jit_kernels/sparse.py to sparse.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/jit_kernels/gating.py to gating.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/moe.py to moe.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/system_init.py to system_init.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/custom/__init__.py to __init__.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/impls/__init__.py to __init__.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/impls/fast_dispatch.py to fast_dispatch.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/impls/jit_compiler.py to jit_compiler.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/impls/communicate.py to communicate.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/impls/moe_layer.py to moe_layer.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/launcher/__init__.py to __init__.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/launcher/execl.py to execl.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/launcher/run.py to run.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/examples/__init__.py to __init__.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/examples/helloworld_deepspeed.py to helloworld_deepspeed.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/examples/helloworld.py to helloworld.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/examples/helloworld_megatron.py to helloworld_megatron.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/examples/helloworld_ddp.py to helloworld_ddp.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/examples/helloworld_amp.py to helloworld_amp.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/examples/helloworld_sharded_experts.py to helloworld_sharded_experts.cpython-36.pyc
creating stub loader for tutel_custom_kernel.cpython-36m-x86_64-linux-gnu.so
byte-compiling build/bdist.linux-x86_64/egg/tutel_custom_kernel.py to tutel_custom_kernel.cpython-36.pyc
creating build/bdist.linux-x86_64/egg/EGG-INFO
copying tutel.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO
copying tutel.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying tutel.egg-info/dependency_links.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying tutel.egg-info/requires.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying tutel.egg-info/top_level.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
writing build/bdist.linux-x86_64/egg/EGG-INFO/native_libs.txt
zip_safe flag not set; analyzing archive contents...
__pycache__.tutel_custom_kernel.cpython-36: module references __file__
creating 'dist/tutel-0.1-py3.6-linux-x86_64.egg' and adding 'build/bdist.linux-x86_64/egg' to it
removing 'build/bdist.linux-x86_64/egg' (and everything under it)
Processing tutel-0.1-py3.6-linux-x86_64.egg
creating /home/fanj/.local/lib/python3.6/site-packages/tutel-0.1-py3.6-linux-x86_64.egg
Extracting tutel-0.1-py3.6-linux-x86_64.egg to /home/fanj/.local/lib/python3.6/site-packages
Adding tutel 0.1 to easy-install.pth file
Installed /home/fanj/.local/lib/python3.6/site-packages/tutel-0.1-py3.6-linux-x86_64.egg
Processing dependencies for tutel==0.1
Finished processing dependencies for tutel==0.1
Thanks!
I see this old file "helloworld_sharded_experts.py" is in the logs, it indicates that some of these codes are not the latest, and I don't see cpp code used by tutel_custom_kernel is being built. Mostly likely it is an environmental issue from pip or setuptools.
Can you further try the following 2 options to check any one of them can work?
Option 1 - Do a clean Install of Tutel from another port:
# Get Rid of Environmental Issues
$ python -m pip install --upgrade pip setuptools
$ python -m pip uninstall tutel -y
$ python -m pip uninstall tutel_custom_kernel -y
# Clean Install from Repo
$ python -m pip install --user git+https://github.com/microsoft/tutel@v0.1.x
# Test
$ python -m tutel.examples.helloworld
Option 2 - Cleanup early build cache to avoid environmental problems:
# Get Rid of Environmental Issues
$ python -m pip install --upgrade pip setuptools
$ python -m pip uninstall tutel -y
$ python -m pip uninstall tutel_custom_kernel -y
# Clean Install from Local
$ rm -r ./tutel/dist ./tutel/build
$ python ./tutel/setup.py install --user
# Test
$ python -m tutel.examples.helloworld
Thanks very much for your help!!! I will try it. I have another question. Could you please have a look? I changed to another system and had a try. It seems I don't have the header file nccl.h. When installing, it shows:
./tutel/custom/custom_kernel.cpp:20:10: fatal error: nccl.h: No such file or directory
20 | #include <nccl.h>
| ^~~~~~~~
compilation terminated.
Try installing without NCCL extension..
Will this error affect the following running process?
Because when I try to run python -m tutel.examples.helloworld
, the following error occurs:
Traceback (most recent call last):
File "/home/wangp/anaconda3/envs/en/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/wangp/anaconda3/envs/en/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/wangp/.local/lib/python3.6/site-packages/tutel-0.1-py3.6-linux-x86_64.egg/tutel/examples/helloworld.py", line 121, in <module>
output = model(x)
File "/home/wangp/anaconda3/envs/en/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wangp/.local/lib/python3.6/site-packages/tutel-0.1-py3.6-linux-x86_64.egg/tutel/examples/helloworld.py", line 86, in forward
result = self._moe_layer(input)
File "/home/wangp/anaconda3/envs/en/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wangp/.local/lib/python3.6/site-packages/tutel-0.1-py3.6-linux-x86_64.egg/tutel/impls/moe_layer.py", line 481, in forward
result_output, l_aux = self.gates[gate_index].apply_on_expert_fn(reshaped_input, self)
File "/home/wangp/.local/lib/python3.6/site-packages/tutel-0.1-py3.6-linux-x86_64.egg/tutel/impls/moe_layer.py", line 111, in apply_on_expert_fn
locations1 = self.compute_location(masks_se[0])
File "/home/wangp/.local/lib/python3.6/site-packages/tutel-0.1-py3.6-linux-x86_64.egg/tutel/jit_kernels/gating.py", line 83, in fast_cumsum_sub_one
return get_cumsum_kernel(int(data.size(0)), int(data.size(1)))(data)
File "/home/wangp/.local/lib/python3.6/site-packages/tutel-0.1-py3.6-linux-x86_64.egg/tutel/jit_kernels/gating.py", line 72, in optimized_cumsum
base_kernel(mask1.to(torch.int32).contiguous(), locations1)
File "/home/wangp/.local/lib/python3.6/site-packages/tutel-0.1-py3.6-linux-x86_64.egg/tutel/impls/jit_compiler.py", line 24, in func
tutel_custom_kernel.invoke(inputs, extra, __ctx__)
RuntimeError: (0) == (cuModuleLoadDataEx(&hMod, image.c_str(), sizeof(options) / sizeof(*options), options, values))INTERNAL ASSERT FAILED at "./tutel/custom/custom_kernel.cpp":185, please report a bug to PyTorch. CHECK_EQ fails.
@LisaWang0306 That dependency-missing error will just skip NCCL related optimization, so it shouldn't be related to the next one. Will any of these commands work?
FAST_CUMSUM=0 python -m tutel.examples.helloworld
USE_NVRTC=0 python -m tutel.examples.helloworld
It can help to determine which option triggers your issue, since I cannot reproduce that in all of our environments. Thanks!
@LisaWang0306 That dependency-missing error will just skip NCCL related optimization, so it shouldn't be related to the next one. Will any of these commands work?
FAST_CUMSUM=0 python -m tutel.examples.helloworld
USE_NVRTC=0 python -m tutel.examples.helloworld
It can help to determine which option triggers your issue, since I cannot reproduce that in all of our environments. Thanks!
Finally! USE_NVRTC=0 python -m tutel.examples.helloworld
works! Thanks!!!!!!!
Thanks for your information.
This should be a bug from NVRTC. Maybe we'll consider setting USE_NVRTC=0
by default since applications cannot guarantee whether CUDA's NVRTC is stable or not.
How about other issues?
There are no more issues left. Thanks!
My cuda version is 11.4, python version is 3.6.5 Following the requirement, my torch and torchvision versions are
torch==1.10.0+cu113
andtorchvision==0.11.1+cu113
. Then I rungit clone https://github.com/microsoft/tutel --branch v0.1.x
python ./tutel/setup.py install --user
then run the tutorial:python ./tutel/examples/helloworld.py --batch_size=16
but meet the following error:Do you know how to solve this problem? Thank you very much!