ModuleNotFoundError: No module named 'fmoe_cuda'

Taskii-Lei commented 1 year ago

Describe the bug I adapt fmoe into Megatron as the tutorial and want to run a script to train gpt. But when I run pretrain_gpt.sh, it raises the error called "ModuleNotFoundError: No module named 'fmoe_cuda'". In detail, I git clone the Megatron-LM repository and modify the functions mentioned in fastmoe/examples/megatron/fmoefy-v2.2.patch. Then, I git clone the fastmoe and put it in the Megatron folder like "./Megatron-LM/fastmoe" to avoid ModuleNotFoundError that may raise. But when I run the pretrain_gpt.sh , it still raises the error. I don't know quite a lot about the module compilation, so I'm here to ask for your great help. Thanks a lot!!

To Reproduce Steps to reproduce the behavior:

Compile with "..."
Run "Megatron-LM/pretrain_gpt.sh" with Linux processes on 1 nodes with 8 gpus per node.

Expected behavior I expect it trains a moefy-Megatron smoothly.

Logs

File "/workspace/S/huanglei/Megatron-LM-moefy/fmoe/functions.py", line 9, in <module>
    import fmoe_cuda
ModuleNotFoundError: No module named 'fmoe_cuda'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3457189) of binary: /lustre/S/huanglei/CondaEnv/MoE/bin/python
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-11-06_15:01:34
  host      : r8a100-b01
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 3457189)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

laekov commented 1 year ago

You are supposed to compile and install the cuda module of fastmoe using setup.py

a-adomavicius commented 3 weeks ago

I'm getting ModuleNotFoundError: No module named 'fmoe_cuda'

when attempting to use fmoefy. I did install the cuda module using setup.py as suggested, but the fmoe_cuda module does not seem to work regardless. Here are the relevant CUDA-related outputs when running the installation setup.

/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py:414: UserWarning: The detected CUDA version (12.2) has a minor version mismatch with the version that was used to compile PyTorch (12.1). Most likely this shouldn't be a problem. warnings.warn(CUDA_MISMATCH_WARN.format(cuda_str_version, torch.version.cuda)) /usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py:424: UserWarning: There are no x86_64-linux-gnu-g++ version bounds defined for CUDA version 12.2 warnings.warn(f'There are no {compiler_name} version bounds defined for CUDA version {cuda_str_version}')

Are there specific CUDA-related requirements that I may be missing/needing to downgrade?

laekov commented 3 weeks ago

I have not tried to compile the fmoe_cuda module with a different nvcc, so I am not sure if you should do the downgrade. I think you should first check whether the fmoe_cuda module is compiled and accessible. There should be a fmoe_cuda.cpython-***.so in the site-packages/fastmoe* directory of your python library directory.

laekov / fastmoe

ModuleNotFoundError: No module named 'fmoe_cuda' #177