microsoft / Tutel

Tutel MoE: An Optimized Mixture-of-Experts Implementation
MIT License
724 stars 93 forks source link

RuntimeError: (0) == (cuModuleLoadDataEx(&hMod, image.c_str(), sizeof(options) / sizeof(*options), options, values)) INTERNAL ASSERT FAILED #215

Closed jd730 closed 1 year ago

jd730 commented 1 year ago

Hi,

I installed tutel via python3 -m pip install --user --upgrade git+https://github.com/microsoft/tutel@main I am running a test script

import torch
from tutel.jit_kernels.gating import fast_cumsum_sub_one

matrix = torch.randint(0, 100, (10000, 100), device='cuda')
cumsum_tutel = fast_cumsum_sub_one(matrix, dim=0) + 1

and facing error

[W custom_kernel.cpp:149] nvrtc: error: invalid value for --gpu-architecture (-arch)
 Failed to use NVRTC for JIT compilation in this Pytorch version, try another approach using CUDA compiler.. (To always disable NVRTC, please: export USE_NVRTC=0)
Traceback (most recent call last):
  File "test.py", line 6, in <module>
    cumsum_tutel = fast_cumsum_sub_one(matrix, dim=0) + 1
  File "/home/jdhwang/.local/lib/python3.8/site-packages/tutel/jit_kernels/gating.py", line 22, in fast_cumsum_sub_one
    return torch.ops.tutel_ops.cumsum(data)
  File "/home/jdhwang/conda/envs/cl/lib/python3.8/site-packages/torch/_ops.py", line 502, in __call__
    return self._op(*args, **kwargs or {})
RuntimeError: (0) == (cuModuleLoadDataEx(&hMod, image.c_str(), sizeof(options) / sizeof(*options), options, values)) INTERNAL ASSERT FAILED at "/tmp/pip-req-build-c9h2prbs/tutel/custom/custom_kernel.cpp":205, please report a bug to PyTorch. CHECK_EQ fails.

following https://github.com/microsoft/tutel/issues/203, I exported export USE_NVRTC=1 and I am using RTX4090 with torch ('2.0.0+cu117') and Cuda 11.7 (nvcc as well).

ghostplant commented 1 year ago

Does anyone of export USE_NVRTC=1 & export USE_NVRTC=0 work? Seems like it is environmental problem (e.g. Multi CUDA version / ..), and it isn't likely to happen if CUDA + Pytorch are in a clean docker container.

jd730 commented 1 year ago

Hi @ghostplant, Thank you for your fast response. If I set export USE_NVRTC=0, it says

nvcc fatal   : Unsupported gpu architecture 'compute_89'
Traceback (most recent call last):
  File "test.py", line 6, in <module>
    cumsum_tutel = fast_cumsum_sub_one(matrix, dim=0) + 1
  File "/home/jdhwang/.local/lib/python3.8/site-packages/tutel/jit_kernels/gating.py", line 22, in fast_cumsum_sub_one
    return torch.ops.tutel_ops.cumsum(data)
  File "/home/jdhwang/conda/envs/cl/lib/python3.8/site-packages/torch/_ops.py", line 502, in __call__
    return self._op(*args, **kwargs or {})
RuntimeError: (true) == (fp != nullptr) INTERNAL ASSERT FAILED at "/tmp/pip-req-build-c9h2prbs/tutel/custom/custom_kernel.cpp":49, please report a bug to PyTorch. CHECK_EQ fails.

I will try to test on clean env and try with cuda11.8 as well.

jd730 commented 1 year ago

It works after upgrading torch (`2.0.1+cu11.8), nvcc and nccl. Thank you!