microsoft / tutel

Tutel MoE: An Optimized Mixture-of-Experts Implementation
MIT License
723 stars 93 forks source link

Cannot compile tutel kernels and got runtime error #163

Closed hyhuang00 closed 2 years ago

hyhuang00 commented 2 years ago

I have installed tutel on my machine and have set up the related environment variables, such as the $CUDA_HOME and $CFLAGS. However, when I try to run examples/hello_world.py, I got the following error:

[E custom_kernel.cpp:124] default_program(1): catastrophic error: cannot open source file "cuda_runtime.h"

1 catastrophic error detected in the compilation of "default_program". Compilation terminated. Failed to use NVRTC for JIT compilation in this Pytorch version, try another approach using CUDA compiler.. (To always disable NVRTC, please: export USE_NVRTC=0)

File "/private/home/hyhuang/.local/lib/python3.9/site-packages/tutel/impls/jit_compiler.py", line 26, in func tutel_custom_kernel.invoke(inputs, ctx) RuntimeError: (true) == (fp != nullptr)INTERNAL ASSERT FAILED at "/tmp/pip-req-build-pcbbciia/tutel/custom/custom_kernel.cpp":40, please report a bug to PyTorch. CHECK_EQ fails.

I am using PyTorch 1.10.1 + CUDA 11.3. Is there any other parameter I should fix to use tutel?

ghostplant commented 2 years ago

Your CUDA environment seems not be installed in the default location(e.g. /usr/local/cuda/include) can you print the value of CUDA_HOME. BTW, you can also try whether export USE_NVRTC=0 will help.

hyhuang00 commented 2 years ago

Your CUDA environment seems not be installed in the default location(e.g. /usr/local/cuda/include) can you print the value of CUDA_HOME. BTW, you can also try whether export USE_NVRTC=0 will help.

Thank you for your prompt reply! Yes, my CUDA environment is not installed in the default location because I'm using a shared computation cluster. Is there a parameter I can fix to ensure the compiler can find the correct CUDA? I will try to use export USE_NVRTC=0

$ echo $CUDA_HOME
/public/apps/cuda/11.3
ghostplant commented 2 years ago

We just merge a PR that parse CUDA_HOME from environment variable. Can you try whether it works for you?

hyhuang00 commented 2 years ago

Thank you! The fix works and I think the CUDA_HOME can be found correctly. I can now successfully run the hello_world.py and hello_world_ddp.py under the examples folder without any error. However, when I tried to use it under the fairseq (the use case is here), I got the following two errors:

File "/private/home/hyhuang/.local/lib/python3.9/site-packages/tutel/jit_kernels/gating.py", line 22, in fast_cumsum_sub_one return torch.ops.tutel_ops.cumsum(data) RuntimeError: (0) == (cuModuleLoadDataEx(&hMod, image.c_str(), sizeof(options) / sizeof(*options), options, values))INTERNAL ASSERT FAILED at "/tmp/pip-req-build-djl73tcc/tutel/custom/custom_kernel.cpp":214, please report a bug to PyTorch. CHECK_EQ fails. return torch.ops.tutel_ops.cumsum(data)


Would you be able to provide any suggestions between these two? I am so confused. This is the same environment I used to run the `hello_world.py` scripts. 
ghostplant commented 2 years ago

You need to run unset USE_NVRTC since you may explicitly configure that variable before.

hyhuang00 commented 2 years ago

Thank you! That completely resolves this problem. Closing the issue.

ghostplant commented 2 years ago

@hyhuang00 Can you help us to test whether the latest version (#170) still work for your environment? As we canceled the way to detect manual CUDA_HOME environment variable, but the new way should be compatible with different environment more robustly.

hyhuang00 commented 2 years ago

Sure, I'm happy to help. Let me try out the new version and I'll let you know if it works for me.

hyhuang00 commented 2 years ago

The new version works on my machine without any error. I installed the package via $ python3 -m pip install --user --upgrade git+https://github.com/microsoft/tutel@main

ghostplant commented 2 years ago

Thanks!