Open Taskii-Lei opened 1 year ago
You are supposed to compile and install the cuda module of fastmoe using setup.py
I'm getting
ModuleNotFoundError: No module named 'fmoe_cuda'
when attempting to use fmoefy. I did install the cuda module using setup.py
as suggested, but the fmoe_cuda module does not seem to work regardless. Here are the relevant CUDA-related outputs when running the installation setup.
/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py:414: UserWarning: The detected CUDA version (12.2) has a minor version mismatch with the version that was used to compile PyTorch (12.1). Most likely this shouldn't be a problem. warnings.warn(CUDA_MISMATCH_WARN.format(cuda_str_version, torch.version.cuda)) /usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py:424: UserWarning: There are no x86_64-linux-gnu-g++ version bounds defined for CUDA version 12.2 warnings.warn(f'There are no {compiler_name} version bounds defined for CUDA version {cuda_str_version}')
Are there specific CUDA-related requirements that I may be missing/needing to downgrade?
I have not tried to compile the fmoe_cuda module with a different nvcc, so I am not sure if you should do the downgrade. I think you should first check whether the fmoe_cuda
module is compiled and accessible. There should be a fmoe_cuda.cpython-***.so
in the site-packages/fastmoe*
directory of your python library directory.
Describe the bug I adapt fmoe into Megatron as the tutorial and want to run a script to train gpt. But when I run
pretrain_gpt.sh
, it raises the error called "ModuleNotFoundError: No module named 'fmoe_cuda'". In detail, I git clone the Megatron-LM repository and modify the functions mentioned infastmoe/examples/megatron/fmoefy-v2.2.patch
. Then, I git clone thefastmoe
and put it in the Megatron folder like "./Megatron-LM/fastmoe" to avoid ModuleNotFoundError that may raise. But when I run thepretrain_gpt.sh
, it still raises the error. I don't know quite a lot about the module compilation, so I'm here to ask for your great help. Thanks a lot!!To Reproduce Steps to reproduce the behavior:
Expected behavior I expect it trains a moefy-Megatron smoothly.
Logs