Open Aaricis opened 11 months ago
How about install colossalai with CUDA_EXT=1 pip install colossalai
?
Hi have you found a solution to this problem? I am encountering the same problem with colossalai-0.3.2, torch.2.2.0.dev+cu121 ,cuda12.2
Hi have you found a solution to this problem? I am encountering the same problem with colossalai-0.3.2, torch.2.2.0.dev+cu121 ,cuda12.2
hi ColossalAI does not support Torch 2.0 and above. Torch 1.13.1 is recommended.
Thanks! I heard colossal-ai was tested on h800. What env (cuda, torch) was used?
🐛 Describe the bug
I got some errors when running resnet.
`(colossal-AI) [root@node64 resnet]# colossalai run --nproc_per_node 1 train.py -c ./ckpt-fp32 [07/25/23 20:27:25] INFO colossalai - colossalai - INFO:
/root/anaconda3/envs/colossal-AI/lib/python3.9/site -packages/colossalai/context/parallel_context.py:52 2 set_device
INFO colossalai - colossalai - INFO: process rank 0 is
bound to device 0
[07/25/23 20:27:28] INFO colossalai - colossalai - INFO:
/root/anaconda3/envs/colossal-AI/lib/python3.9/site -packages/colossalai/context/parallel_context.py:55 8 set_seed
INFO colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 1024, python random: 1024,
ParallelMode.DATA: 1024, ParallelMode.TENSOR:
1024,the default parallel seed is
ParallelMode.DATA.
INFO colossalai - colossalai - INFO:
/root/anaconda3/envs/colossal-AI/lib/python3.9/site -packages/colossalai/initialize.py:115 launch
INFO colossalai - colossalai - INFO: Distributed
environment is initialized, data parallel size: 1, pipeline parallel size: 1, tensor parallel size: 1 /root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/colossalai/booster/booster.py:69: UserWarning: The plugin will control the accelerator, so the device argument will be ignored. warnings.warn('The plugin will control the accelerator, so the device argument will be ignored.') Files already downloaded and verified Files already downloaded and verified /root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/torch/utils/cpp_extension.py:329: UserWarning:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Your compiler (c++ 4.8.5) may be ABI-incompatible with PyTorch! Please use a compiler that is ABI-compatible with GCC 5.0 and above. See https://gcc.gnu.org/onlinedocs/libstdc++/manual/abi.html.
See https://gist.github.com/goldsborough/d466f43e8ffc948ff92de7486c5216d6 for instructions on how to install GCC 5 or higher. !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
warnings.warn(ABI_INCOMPATIBILITY_WARNING.format(compiler)) Traceback (most recent call last): File "/root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/colossalai/kernel/op_builder/builder.py", line 161, in load op_module = self.import_op() File "/root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/colossalai/kernel/op_builder/builder.py", line 110, in import_op return importlib.import_module(self.prebuilt_import_path) File "/root/anaconda3/envs/colossal-AI/lib/python3.9/importlib/init.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 1030, in _gcd_import
File "", line 1007, in _find_and_load
File "", line 984, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'colossalai._C.cpu_adam'
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1740, in _run_ninja_build subprocess.run( File "/root/anaconda3/envs/colossal-AI/lib/python3.9/subprocess.py", line 528, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "/home/z00621429/ColossalAI/examples/images/resnet/train.py", line 204, in
main()
File "/home/z00621429/ColossalAI/examples/images/resnet/train.py", line 163, in main
optimizer = HybridAdam(model.parameters(), lr=LEARNING_RATE)
File "/root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/colossalai/nn/optimizer/hybrid_adam.py", line 82, in init
cpu_optim = CPUAdamBuilder().load()
File "/root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/colossalai/kernel/op_builder/builder.py", line 187, in load
op_module = load(name=self.name,
File "/root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1144, in load
return _jit_compile(
File "/root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1357, in _jit_compile
_write_ninja_file_and_build_library(
File "/root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1469, in _write_ninja_file_and_build_library
_run_ninja_build(
File "/root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1756, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'cpu_adam': [1/2] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/includes -I/usr/local/cuda/include -isystem /root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/torch/include -isystem /root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/torch/include/TH -isystem /root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/torch/include/THC -isystem /root/anaconda3/envs/colossal-AI/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -std=c++14 -lcudart -lcublas -g -Wno-reorder -fopenmp -march=native -c /root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/cpu_adam.cpp -o cpu_adam.o
FAILED: cpu_adam.o
c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/includes -I/usr/local/cuda/include -isystem /root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/torch/include -isystem /root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/torch/include/TH -isystem /root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/torch/include/THC -isystem /root/anaconda3/envs/colossal-AI/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -std=c++14 -lcudart -lcublas -g -Wno-reorder -fopenmp -march=native -c /root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/colossalai/kernel/cuda_native/csrc/cpu_adam.cpp -o cpu_adam.o
c++: error: unrecognized command line option ‘-std=c++14’
c++: error: unrecognized command line option ‘-std=c++14’
ninja: build stopped: subcommand failed.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 18372) of binary: /root/anaconda3/envs/colossal-AI/bin/python Traceback (most recent call last): File "/root/anaconda3/envs/colossal-AI/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==1.11.0', 'console_scripts', 'torchrun')())
File "/root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/torch/distributed/run.py", line 724, in main
run(args)
File "/root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/torch/distributed/run.py", line 715, in run
elastic_launch(
File "/root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/anaconda3/envs/colossal-AI/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train.py FAILED
Failures: