microsoft / tutel

Tutel MoE: An Optimized Mixture-of-Experts Implementation
MIT License
694 stars 84 forks source link

INTERNAL ASSERT FAILED #203

Open Qicheng-WANG opened 1 year ago

Qicheng-WANG commented 1 year ago

Hi there, When I ran a quick test "python3 -m tutel.examples.helloworld --batch_size=16", it showed error as follow: RuntimeError: (true) == (fp != nullptr)INTERNAL ASSERT FAILED at "/ssdisk2/tutel/tutel/custom/custom_kernel.cpp":46, please report a bug to PyTorch. CHECK_EQ fails. Could you help me fix it?Thanks

Qicheng-WANG commented 1 year ago

It also showed image I am using NVIDIA 3090 and CUDA11.3

ghostplant commented 1 year ago
  1. Does print(torch.cuda.get_arch_list()) include sm_86?
  2. Can you try export USE_NVRTC=1 before running the example?
  3. Are you sure there is no other old CUDA installed so that an old nvcc command was wrongly called for this compilation?
monster119120 commented 1 year ago
  1. Does print(torch.cuda.get_arch_list()) include sm_86?
  2. Can you try export USE_NVRTC=1 before running the example?
  3. Are you sure there is no other old CUDA installed so that an old nvcc command was wrongly called for this compilation?

Hi! I am running tutel in jetson nano b01 (4GB version) I also meet problem "RuntimeError: (true) == (fp != nullptr)INTERNAL ASSERT FAILED at "/ssdisk2/tutel/tutel/custom/custom_kernel.cpp".

In the nano computer, 1.print(torch.cuda.get_arch_list() is ['sm_53', 'sm_62', 'sm72']

  1. I use export USE_NVRTC=1, but another error occurred.
  2. My nvcc version is 10.2.3
ghostplant commented 1 year ago

This is the problem from Pytorch + CUDA not tutel. You need a pytorch built with at least cu117/118 so that torch.cuda.get_arch_list() should include sm_86. You also need to update your CUDA SDK (e.g. to 12.0) since NVDIA's new GPU is not compatible with its older NVCC SDK.

ghostplant commented 1 year ago

CUDA 10.2.3 is too old and it cannot support any new GPU that is above V100 (sm_7x). CUDA 11 should support A100 related types and CUDA 12 should support H100 related types. After upgrading CUDA SDK, please also reinstall pytorch that is built upon at least cu118.