microsoft / tutel

Tutel MoE: An Optimized Mixture-of-Experts Implementation
MIT License
723 stars 93 forks source link

module 'tutel_custom_kernel' has no attribute 'inject_source' #132

Closed LisaWang0306 closed 2 years ago

LisaWang0306 commented 2 years ago

My cuda version is 11.4, python version is 3.6.5 Following the requirement, my torch and torchvision versions are torch==1.10.0+cu113 and torchvision==0.11.1+cu113. Then I run git clone https://github.com/microsoft/tutel --branch v0.1.x python ./tutel/setup.py install --user then run the tutorial: python ./tutel/examples/helloworld.py --batch_size=16 but meet the following error:

Traceback (most recent call last):
  File "./tutel/examples/helloworld.py", line 118, in <module>
    output = model(x)
  File "/home/fanj/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "./tutel/examples/helloworld.py", line 85, in forward
    result = self._moe_layer(input)
  File "/home/fanj/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/fanj/tutel/tutel/impls/moe_layer.py", line 424, in forward
    result_output, l_aux = self.gates[gate_index].apply_on_expert_fn(reshaped_input, self)
  File "/home/fanj/tutel/tutel/impls/moe_layer.py", line 73, in apply_on_expert_fn
    critical_data, l_loss = extract_critical(gates, self.top_k, self.capacity_factor, self.fp32_gate, self.batch_prioritized_routing)
  File "/home/fanj/tutel/tutel/impls/fast_dispatch.py", line 163, in extract_critical
    locations1 = compute_location(masks_se[0])
  File "/home/fanj/tutel/tutel/jit_kernels/gating.py", line 83, in fast_cumsum_sub_one
    return get_cumsum_kernel(int(data.size(0)), int(data.size(1)))(data)
  File "/home/fanj/tutel/tutel/jit_kernels/gating.py", line 68, in get_cumsum_kernel
    ''')
  File "/home/fanj/tutel/tutel/impls/jit_compiler.py", line 31, in generate_kernel
    return JitCompiler.create_raw(template)
  File "/home/fanj/tutel/tutel/impls/jit_compiler.py", line 21, in create_raw
    __ctx__ = tutel_custom_kernel.inject_source(source)
AttributeError: module 'tutel_custom_kernel' has no attribute 'inject_source'

Do you know how to solve this problem? Thank you very much!

ghostplant commented 2 years ago

I don't know whether python & python3 command target to the same python runtime in your environment. Can you try using python3 ./tutel/setup.py install --user instead of python ./tutel/setup.py install --user to install tutel?

LisaWang0306 commented 2 years ago

I don't know whether python & python3 command target to the same python runtime in your environment. Can you try using python3 ./tutel/setup.py install --user instead of python ./tutel/setup.py install --user to install tutel?

Sorry I made a mistake, for the tutorial I still used python ./tutel/examples/helloworld.py --batch_size=16 to run. For this case, do you have any idea why it came up with this error? Thanks very much for your reply!!

ghostplant commented 2 years ago

Firstly, can you run "python -m pip uninstall tutel" many times to ensure it is fully cleaned?

Then, can you run and share the output logs of that installation command python ./tutel/setup.py install --user? Thanks!

LisaWang0306 commented 2 years ago

Firstly, can you run "python -m pip uninstall tutel" many times to ensure it is fully cleaned?

Then, can you run and share the output logs of that installation command python ./tutel/setup.py install --user? Thanks!

Thanks for your suggestions! I run python -m pip uninstall tutel for three times and it already shows WARNING: Skipping tutel as it is not installed. The complete output log of the installation command python ./tutel/setup.py install --user is as follow:

(en1) fanj@worker124:~$ python ./tutel/setup.py install --user
running install
running bdist_egg
running egg_info
writing tutel.egg-info/PKG-INFO
writing dependency_links to tutel.egg-info/dependency_links.txt
writing requirements to tutel.egg-info/requires.txt
writing top-level names to tutel.egg-info/top_level.txt
/home/fanj/.local/lib/python3.6/site-packages/torch/utils/cpp_extension.py:381: UserWarning: Attempted to use ninja as the BuildExtension backend but we could not find ninja.. Falling back to using the slow distutils backend.
  warnings.warn(msg.format('we could not find ninja.'))
reading manifest file 'tutel.egg-info/SOURCES.txt'
writing manifest file 'tutel.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
running build_ext
creating build/bdist.linux-x86_64/egg
copying build/lib.linux-x86_64-3.6/tutel_custom_kernel.cpython-36m-x86_64-linux-gnu.so -> build/bdist.linux-x86_64/egg
creating build/bdist.linux-x86_64/egg/tutel
copying build/lib.linux-x86_64-3.6/tutel/__init__.py -> build/bdist.linux-x86_64/egg/tutel
creating build/bdist.linux-x86_64/egg/tutel/parted
copying build/lib.linux-x86_64-3.6/tutel/parted/__init__.py -> build/bdist.linux-x86_64/egg/tutel/parted
copying build/lib.linux-x86_64-3.6/tutel/parted/solver.py -> build/bdist.linux-x86_64/egg/tutel/parted
copying build/lib.linux-x86_64-3.6/tutel/parted/patterns.py -> build/bdist.linux-x86_64/egg/tutel/parted
copying build/lib.linux-x86_64-3.6/tutel/parted/spmdx.py -> build/bdist.linux-x86_64/egg/tutel/parted
creating build/bdist.linux-x86_64/egg/tutel/parted/backend
copying build/lib.linux-x86_64-3.6/tutel/parted/backend/__init__.py -> build/bdist.linux-x86_64/egg/tutel/parted/backend
creating build/bdist.linux-x86_64/egg/tutel/parted/backend/torch
copying build/lib.linux-x86_64-3.6/tutel/parted/backend/torch/__init__.py -> build/bdist.linux-x86_64/egg/tutel/parted/backend/torch
copying build/lib.linux-x86_64-3.6/tutel/parted/backend/torch/executor.py -> build/bdist.linux-x86_64/egg/tutel/parted/backend/torch
copying build/lib.linux-x86_64-3.6/tutel/parted/backend/torch/config.py -> build/bdist.linux-x86_64/egg/tutel/parted/backend/torch
creating build/bdist.linux-x86_64/egg/tutel/jit_kernels
copying build/lib.linux-x86_64-3.6/tutel/jit_kernels/__init__.py -> build/bdist.linux-x86_64/egg/tutel/jit_kernels
copying build/lib.linux-x86_64-3.6/tutel/jit_kernels/sparse.py -> build/bdist.linux-x86_64/egg/tutel/jit_kernels
copying build/lib.linux-x86_64-3.6/tutel/jit_kernels/gating.py -> build/bdist.linux-x86_64/egg/tutel/jit_kernels
copying build/lib.linux-x86_64-3.6/tutel/moe.py -> build/bdist.linux-x86_64/egg/tutel
copying build/lib.linux-x86_64-3.6/tutel/system_init.py -> build/bdist.linux-x86_64/egg/tutel
creating build/bdist.linux-x86_64/egg/tutel/custom
copying build/lib.linux-x86_64-3.6/tutel/custom/__init__.py -> build/bdist.linux-x86_64/egg/tutel/custom
creating build/bdist.linux-x86_64/egg/tutel/impls
copying build/lib.linux-x86_64-3.6/tutel/impls/__init__.py -> build/bdist.linux-x86_64/egg/tutel/impls
copying build/lib.linux-x86_64-3.6/tutel/impls/fast_dispatch.py -> build/bdist.linux-x86_64/egg/tutel/impls
copying build/lib.linux-x86_64-3.6/tutel/impls/jit_compiler.py -> build/bdist.linux-x86_64/egg/tutel/impls
copying build/lib.linux-x86_64-3.6/tutel/impls/communicate.py -> build/bdist.linux-x86_64/egg/tutel/impls
copying build/lib.linux-x86_64-3.6/tutel/impls/moe_layer.py -> build/bdist.linux-x86_64/egg/tutel/impls
creating build/bdist.linux-x86_64/egg/tutel/launcher
copying build/lib.linux-x86_64-3.6/tutel/launcher/__init__.py -> build/bdist.linux-x86_64/egg/tutel/launcher
copying build/lib.linux-x86_64-3.6/tutel/launcher/execl.py -> build/bdist.linux-x86_64/egg/tutel/launcher
copying build/lib.linux-x86_64-3.6/tutel/launcher/run.py -> build/bdist.linux-x86_64/egg/tutel/launcher
creating build/bdist.linux-x86_64/egg/tutel/examples
copying build/lib.linux-x86_64-3.6/tutel/examples/__init__.py -> build/bdist.linux-x86_64/egg/tutel/examples
copying build/lib.linux-x86_64-3.6/tutel/examples/helloworld_deepspeed.py -> build/bdist.linux-x86_64/egg/tutel/examples
copying build/lib.linux-x86_64-3.6/tutel/examples/helloworld.py -> build/bdist.linux-x86_64/egg/tutel/examples
copying build/lib.linux-x86_64-3.6/tutel/examples/helloworld_megatron.py -> build/bdist.linux-x86_64/egg/tutel/examples
copying build/lib.linux-x86_64-3.6/tutel/examples/helloworld_ddp.py -> build/bdist.linux-x86_64/egg/tutel/examples
copying build/lib.linux-x86_64-3.6/tutel/examples/helloworld_amp.py -> build/bdist.linux-x86_64/egg/tutel/examples
copying build/lib.linux-x86_64-3.6/tutel/examples/helloworld_sharded_experts.py -> build/bdist.linux-x86_64/egg/tutel/examples
byte-compiling build/bdist.linux-x86_64/egg/tutel/__init__.py to __init__.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/parted/__init__.py to __init__.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/parted/solver.py to solver.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/parted/patterns.py to patterns.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/parted/spmdx.py to spmdx.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/parted/backend/__init__.py to __init__.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/parted/backend/torch/__init__.py to __init__.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/parted/backend/torch/executor.py to executor.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/parted/backend/torch/config.py to config.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/jit_kernels/__init__.py to __init__.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/jit_kernels/sparse.py to sparse.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/jit_kernels/gating.py to gating.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/moe.py to moe.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/system_init.py to system_init.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/custom/__init__.py to __init__.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/impls/__init__.py to __init__.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/impls/fast_dispatch.py to fast_dispatch.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/impls/jit_compiler.py to jit_compiler.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/impls/communicate.py to communicate.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/impls/moe_layer.py to moe_layer.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/launcher/__init__.py to __init__.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/launcher/execl.py to execl.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/launcher/run.py to run.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/examples/__init__.py to __init__.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/examples/helloworld_deepspeed.py to helloworld_deepspeed.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/examples/helloworld.py to helloworld.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/examples/helloworld_megatron.py to helloworld_megatron.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/examples/helloworld_ddp.py to helloworld_ddp.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/examples/helloworld_amp.py to helloworld_amp.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/examples/helloworld_sharded_experts.py to helloworld_sharded_experts.cpython-36.pyc
creating stub loader for tutel_custom_kernel.cpython-36m-x86_64-linux-gnu.so
byte-compiling build/bdist.linux-x86_64/egg/tutel_custom_kernel.py to tutel_custom_kernel.cpython-36.pyc
creating build/bdist.linux-x86_64/egg/EGG-INFO
copying tutel.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO
copying tutel.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying tutel.egg-info/dependency_links.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying tutel.egg-info/requires.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying tutel.egg-info/top_level.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
writing build/bdist.linux-x86_64/egg/EGG-INFO/native_libs.txt
zip_safe flag not set; analyzing archive contents...
__pycache__.tutel_custom_kernel.cpython-36: module references __file__
creating 'dist/tutel-0.1-py3.6-linux-x86_64.egg' and adding 'build/bdist.linux-x86_64/egg' to it
removing 'build/bdist.linux-x86_64/egg' (and everything under it)
Processing tutel-0.1-py3.6-linux-x86_64.egg
creating /home/fanj/.local/lib/python3.6/site-packages/tutel-0.1-py3.6-linux-x86_64.egg
Extracting tutel-0.1-py3.6-linux-x86_64.egg to /home/fanj/.local/lib/python3.6/site-packages
Adding tutel 0.1 to easy-install.pth file

Installed /home/fanj/.local/lib/python3.6/site-packages/tutel-0.1-py3.6-linux-x86_64.egg
Processing dependencies for tutel==0.1
Finished processing dependencies for tutel==0.1

Thanks!

ghostplant commented 2 years ago

I see this old file "helloworld_sharded_experts.py" is in the logs, it indicates that some of these codes are not the latest, and I don't see cpp code used by tutel_custom_kernel is being built. Mostly likely it is an environmental issue from pip or setuptools.

Can you further try the following 2 options to check any one of them can work?

Option 1 - Do a clean Install of Tutel from another port:

# Get Rid of Environmental Issues
$ python -m pip install --upgrade pip setuptools
$ python -m pip uninstall tutel -y
$ python -m pip uninstall tutel_custom_kernel -y

# Clean Install from Repo
$ python -m pip install --user git+https://github.com/microsoft/tutel@v0.1.x

# Test
$ python -m tutel.examples.helloworld

Option 2 - Cleanup early build cache to avoid environmental problems:

# Get Rid of Environmental Issues
$ python -m pip install --upgrade pip setuptools
$ python -m pip uninstall tutel -y
$ python -m pip uninstall tutel_custom_kernel -y

# Clean Install from Local
$ rm -r ./tutel/dist ./tutel/build
$ python ./tutel/setup.py install --user

# Test
$ python -m tutel.examples.helloworld
LisaWang0306 commented 2 years ago

Thanks very much for your help!!! I will try it. I have another question. Could you please have a look? I changed to another system and had a try. It seems I don't have the header file nccl.h. When installing, it shows:

./tutel/custom/custom_kernel.cpp:20:10: fatal error: nccl.h: No such file or directory
   20 | #include <nccl.h>
      |          ^~~~~~~~
compilation terminated.
Try installing without NCCL extension..

Will this error affect the following running process? Because when I try to run python -m tutel.examples.helloworld, the following error occurs:

Traceback (most recent call last):
  File "/home/wangp/anaconda3/envs/en/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/wangp/anaconda3/envs/en/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/wangp/.local/lib/python3.6/site-packages/tutel-0.1-py3.6-linux-x86_64.egg/tutel/examples/helloworld.py", line 121, in <module>
    output = model(x)
  File "/home/wangp/anaconda3/envs/en/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wangp/.local/lib/python3.6/site-packages/tutel-0.1-py3.6-linux-x86_64.egg/tutel/examples/helloworld.py", line 86, in forward
    result = self._moe_layer(input)
  File "/home/wangp/anaconda3/envs/en/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wangp/.local/lib/python3.6/site-packages/tutel-0.1-py3.6-linux-x86_64.egg/tutel/impls/moe_layer.py", line 481, in forward
    result_output, l_aux = self.gates[gate_index].apply_on_expert_fn(reshaped_input, self)
  File "/home/wangp/.local/lib/python3.6/site-packages/tutel-0.1-py3.6-linux-x86_64.egg/tutel/impls/moe_layer.py", line 111, in apply_on_expert_fn
    locations1 = self.compute_location(masks_se[0])
  File "/home/wangp/.local/lib/python3.6/site-packages/tutel-0.1-py3.6-linux-x86_64.egg/tutel/jit_kernels/gating.py", line 83, in fast_cumsum_sub_one
    return get_cumsum_kernel(int(data.size(0)), int(data.size(1)))(data)
  File "/home/wangp/.local/lib/python3.6/site-packages/tutel-0.1-py3.6-linux-x86_64.egg/tutel/jit_kernels/gating.py", line 72, in optimized_cumsum
    base_kernel(mask1.to(torch.int32).contiguous(), locations1)
  File "/home/wangp/.local/lib/python3.6/site-packages/tutel-0.1-py3.6-linux-x86_64.egg/tutel/impls/jit_compiler.py", line 24, in func
    tutel_custom_kernel.invoke(inputs, extra, __ctx__)
RuntimeError: (0) == (cuModuleLoadDataEx(&hMod, image.c_str(), sizeof(options) / sizeof(*options), options, values))INTERNAL ASSERT FAILED at "./tutel/custom/custom_kernel.cpp":185, please report a bug to PyTorch. CHECK_EQ fails.
ghostplant commented 2 years ago

@LisaWang0306 That dependency-missing error will just skip NCCL related optimization, so it shouldn't be related to the next one. Will any of these commands work?

  1. FAST_CUMSUM=0 python -m tutel.examples.helloworld
  2. USE_NVRTC=0 python -m tutel.examples.helloworld

It can help to determine which option triggers your issue, since I cannot reproduce that in all of our environments. Thanks!

LisaWang0306 commented 2 years ago

@LisaWang0306 That dependency-missing error will just skip NCCL related optimization, so it shouldn't be related to the next one. Will any of these commands work?

  1. FAST_CUMSUM=0 python -m tutel.examples.helloworld
  2. USE_NVRTC=0 python -m tutel.examples.helloworld

It can help to determine which option triggers your issue, since I cannot reproduce that in all of our environments. Thanks!

Finally! USE_NVRTC=0 python -m tutel.examples.helloworld works! Thanks!!!!!!!

ghostplant commented 2 years ago

Thanks for your information. This should be a bug from NVRTC. Maybe we'll consider setting USE_NVRTC=0 by default since applications cannot guarantee whether CUDA's NVRTC is stable or not. How about other issues?

LisaWang0306 commented 2 years ago

There are no more issues left. Thanks!