Open eccct opened 1 day ago
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
Title: [DOC]: Environment installation failed
再次运行benchmark,bash gemini.sh后系统长时间停顿在编译阶段。 使用colossalai check -i这个命令来检查目前环境里的版本兼容性以及CUDA Extension的状态。 请帮忙分析一下原因,谢谢!
You should use BUILD_EXT=1 pip install .
and see if that compiles.
I tried to use BUILD_EXT=1 pip install . and it failed to build, please check the log files I uploaded. Thanks! pip install -r requirementsrequirements.txt BUILD_EXT=1 pip install.txt
You should troubleshoot your issue from the following aspects (the provided log information is limited). First, check that there are no issues with your machine, for example, by running nvidia-smi to confirm the availability of the GPUs. Check environment variables such as CUDA_VISIBLE_DEVICES, and ensure that LD_LIBRARY_PATH and CUDA_HOME are pointing to the correct CUDA version.
Oh, I got it, seems like it's keeping compiling the JIT kernel op, it really takes some time and you didn't finish the compiling.
Yesterday I ran " instead of "CUDA_EXT=1 pip install .", it build successfully. Then I ran benchmark with "bash gemini.sh", it took long time without responding. I updated the ticket and Edenzzzz replied me to use BUILD_EXT=1 pip install . and see if that compiles. Then I ran "pip install -r requirementsrequirements" and returned successfully. I used "BUILD_EXT=1 pip install . " instead of "CUDA_EXT=1 pip install .", it failed to build. Please check two uploaded files. Thanks!
(colo01) root@DESKTOP-5H0EB03:~/ColossalAI# echo $CUDA_HOME /usr/local/cuda-12.1
(colo01) root@DESKTOP-5H0EB03:~/ColossalAI# echo $PATH /root/miniconda3/envs/colo01/bin:/usr/local/cuda-12.1/bin:/root/miniconda3/envs/colo01/bin:/root/miniconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/lib/wsl/lib:/mnt/c/Windows/system32:/mnt/c/Windows:/mnt/c/Windows/System32/Wbem:/mnt/c/Windows/System32/WindowsPowerShell/v1.0:/mnt/c/Windows/System32/OpenSSH:/mnt/c/Program Files (x86)/NVIDIA Corporation/PhysX/Common:/mnt/c/Program Files/NVIDIA Corporation/NVIDIA NvDLISR:/mnt/c/WINDOWS/system32:/mnt/c/WINDOWS:/mnt/c/WINDOWS/System32/Wbem:/mnt/c/WINDOWS/System32/WindowsPowerShell/v1.0:/mnt/c/WINDOWS/System32/OpenSSH:/mnt/c/Program Files/PuTTY:/mnt/c/Program Files/Git/cmd:/mnt/c/Users/eccct/anaconda3/Scripts:/mnt/c/Program Files/OpenSSH-Win64:/mnt/c/Program Files/nodejs:/mnt/c/Program Files/Microsoft SQL Server/130/Tools/Binn:/mnt/c/Program Files/Microsoft SQL Server/Client SDK/ODBC/170/Tools/Binn:/mnt/c/Program Files/Docker/Docker/resources/bin:/mnt/c/Users/eccct/AppData/Local/Microsoft/WindowsApps:/mnt/d/A:/mnt/c/Users/eccct/Anaconda3:/mnt/c/Users/eccct/Anaconda3/Library/mingw-w64/bin:/mnt/c/Users/eccct/Anaconda3/Library/usr/bin:/mnt/c/Users/eccct/Anaconda3/Library/bin:/mnt/c/Users/eccct/Anaconda3/Scripts:/mnt/c/Users/eccct/AppData/Roaming/Aria2/maria2c.exe:/mnt/d/Microsoft VS Code/bin:/mnt/c/Users/eccct/AppData/Roaming/npm:/mnt/c/Users/eccct/.dotnet/tools:/mnt/c/Users/eccct/AppData/Local/Programs/Azure Dev CLI:/mnt/c/Users/eccct/.cache/lm-studio/bin:/snap/bin
(colo01) root@DESKTOP-5H0EB03:~/ColossalAI# echo $LD_LIBRARY_PATH /usr/local/cuda-12.1/lib64:
(colo01) root@DESKTOP-5H0EB03:~/ColossalAI/examples/language/llama/scripts/benchmark_7B# bash gemini.sh
/root/miniconda3/envs/colo01/lib/python3.10/site-packages/colossalai-0.4.4-py3.10.egg/colossalai/pipeline/schedule/_utils.py:19: FutureWarning: torch.utils._pytree._register_pytree_node
is deprecated. Please use torch.utils._pytree.register_pytree_node
instead.
_register_pytree_node(OrderedDict, _odict_flatten, _odict_unflatten)
/root/miniconda3/envs/colo01/lib/python3.10/site-packages/torch/utils/_pytree.py:332: UserWarning: <class 'collections.OrderedDict'> is already registered as pytree node. Overwriting the previous registration.
warnings.warn(
/root/miniconda3/envs/colo01/lib/python3.10/site-packages/colossalai-0.4.4-py3.10.egg/colossalai/shardformer/layer/normalization.py:45: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel")
/root/miniconda3/envs/colo01/lib/python3.10/site-packages/colossalai-0.4.4-py3.10.egg/colossalai/legacy/registry/init.py:1: DeprecationWarning: TorchScript
support for functional optimizers is deprecated and will be removed in a future PyTorch release. Consider using the torch.compile
optimizer instead.
import torch.distributed.optim as dist_optim
/root/miniconda3/envs/colo01/lib/python3.10/site-packages/colossalai-0.4.4-py3.10.egg/colossalai/accelerator/cuda_accelerator.py:282: FutureWarning: torch.cuda.amp.autocast(args...)
is deprecated. Please use torch.amp.autocast('cuda', args...)
instead.
return torch.cuda.amp.autocast(enabled=enabled, dtype=dtype, cache_enabled=cache_enabled)
[09/23/24 11:49:40] INFO colossalai - colossalai - INFO:
/root/miniconda3/envs/colo01/lib/python3.10/site-pa
ckages/colossalai-0.4.4-py3.10.egg/colossalai/initi
alize.py:75 launch
INFO colossalai - colossalai - INFO: Distributed
environment is initialized, world size: 1
WARNING colossalai - colossalai - WARNING:
/root/miniconda3/envs/colo01/lib/python3.10/site-pa
ckages/colossalai-0.4.4-py3.10.egg/colossalai/boost
er/plugin/gemini_plugin.py:382 init
WARNING colossalai - colossalai - WARNING:
enable_async_reduce sets pin_memory=True to achieve
best performance, which is not implicitly set.
Model params: 1.19 B
[extension] Compiling the JIT cpu_adam_x86 kernel during runtime now
(colo01) root@DESKTOP-5H0EB03:~/ColossalAI/examples/language/llama/scripts/benchmark_7B# bash gemini.sh
/root/miniconda3/envs/colo01/lib/python3.10/site-packages/colossalai-0.4.4-py3.10.egg/colossalai/pipeline/schedule/_utils.py:19: FutureWarning: torch.utils._pytree._register_pytree_node
is deprecated. Please use torch.utils._pytree.register_pytree_node
instead.
_register_pytree_node(OrderedDict, _odict_flatten, _odict_unflatten)
/root/miniconda3/envs/colo01/lib/python3.10/site-packages/torch/utils/_pytree.py:332: UserWarning: <class 'collections.OrderedDict'> is already registered as pytree node. Overwriting the previous registration.
warnings.warn(
/root/miniconda3/envs/colo01/lib/python3.10/site-packages/colossalai-0.4.4-py3.10.egg/colossalai/shardformer/layer/normalization.py:45: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel")
/root/miniconda3/envs/colo01/lib/python3.10/site-packages/colossalai-0.4.4-py3.10.egg/colossalai/legacy/registry/init.py:1: DeprecationWarning: TorchScript
support for functional optimizers is deprecated and will be removed in a future PyTorch release. Consider using the torch.compile
optimizer instead.
import torch.distributed.optim as dist_optim
/root/miniconda3/envs/colo01/lib/python3.10/site-packages/colossalai-0.4.4-py3.10.egg/colossalai/accelerator/cuda_accelerator.py:282: FutureWarning: torch.cuda.amp.autocast(args...)
is deprecated. Please use torch.amp.autocast('cuda', args...)
instead.
return torch.cuda.amp.autocast(enabled=enabled, dtype=dtype, cache_enabled=cache_enabled)
[09/23/24 11:49:40] INFO colossalai - colossalai - INFO:
/root/miniconda3/envs/colo01/lib/python3.10/site-pa
ckages/colossalai-0.4.4-py3.10.egg/colossalai/initi
alize.py:75 launch
INFO colossalai - colossalai - INFO: Distributed
environment is initialized, world size: 1
WARNING colossalai - colossalai - WARNING:
/root/miniconda3/envs/colo01/lib/python3.10/site-pa
ckages/colossalai-0.4.4-py3.10.egg/colossalai/boost
er/plugin/gemini_plugin.py:382 init
WARNING colossalai - colossalai - WARNING:
enable_async_reduce sets pin_memory=True to achieve
best performance, which is not implicitly set.
Model params: 1.19 B
[extension] Compiling the JIT cpu_adam_x86 kernel during runtime now
[extension] Time taken to compile cpu_adam_x86 op: 26.49249792098999 seconds
[extension] Compiling the JIT fused_optim_cuda kernel during runtime now
[extension] Time taken to compile fused_optim_cuda op: 57.339394330978394 seconds
rank0: Traceback (most recent call last):
rank0: File "/root/miniconda3/envs/colo01/lib/python3.10/site-packages/colossalai-0.4.4-py3.10.egg/colossalai/zero/gemini/chunk/manager.py", line 82, in register_tensor
rank0: File "/root/miniconda3/envs/colo01/lib/python3.10/site-packages/colossalai-0.4.4-py3.10.egg/colossalai/zero/gemini/chunk/chunk.py", line 277, in append_tensor rank0: raise ChunkFullError
rank0: During handling of the above exception, another exception occurred:
rank0: Traceback (most recent call last):
rank0: File "/root/ColossalAI/examples/language/llama/benchmark.py", line 364, in
rank0: File "/root/ColossalAI/examples/language/llama/benchmark.py", line 308, in main rank0: model, optimizer, , dataloader, = booster.boost(model, optimizer, dataloader=dataloader) rank0: File "/root/miniconda3/envs/colo01/lib/python3.10/site-packages/colossalai-0.4.4-py3.10.egg/colossalai/booster/booster.py", line 154, in boost rank0: model, optimizer, criterion, dataloader, lr_scheduler = self.plugin.configure( rank0: File "/root/miniconda3/envs/colo01/lib/python3.10/site-packages/colossalai-0.4.4-py3.10.egg/colossalai/booster/plugin/gemini_plugin.py", line 571, in configure rank0: model = GeminiDDP( rank0: File "/root/miniconda3/envs/colo01/lib/python3.10/site-packages/colossalai-0.4.4-py3.10.egg/colossalai/zero/gemini/gemini_ddp.py", line 183, in init
rank0: File "/root/miniconda3/envs/colo01/lib/python3.10/site-packages/colossalai-0.4.4-py3.10.egg/colossalai/zero/gemini/gemini_ddp.py", line 882, in _init_chunks
rank0: File "/root/miniconda3/envs/colo01/lib/python3.10/site-packages/colossalai-0.4.4-py3.10.egg/colossalai/zero/gemini/chunk/manager.py", line 90, in register_tensor
rank0: File "/root/miniconda3/envs/colo01/lib/python3.10/site-packages/colossalai-0.4.4-py3.10.egg/colossalai/zero/gemini/chunk/manager.py", line 250, in __close_one_chunk
rank0: File "/root/miniconda3/envs/colo01/lib/python3.10/site-packages/colossalai-0.4.4-py3.10.egg/colossalai/zero/gemini/chunk/chunk.py", line 314, in close_chunk
rank0: self.cpu_shard = torch.empty(self.shard_size, dtype=self.dtype, pin_memory=self.pin_memory)
rank0: RuntimeError: CUDA error: out of memory
rank0: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
rank0: For debugging consider passing CUDA_LAUNCH_BLOCKING=1
rank0: Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
Failures:
Hi, can you try out GCC (Ubuntu 9.4.0-1ubuntu1~20.04.2) version 9.4.0?
@flybird11111 (colo01) root@DESKTOP-5H0EB03:~/ColossalAI/examples/language/llama/scripts/benchmark_7B# gcc --version gcc (Ubuntu 12.3.0-17ubuntu1) 12.3.0 Copyright (C) 2022 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Do you mean to degrade gcc 12.3 to 9.4?
@flybird11111 (colo01) root@DESKTOP-5H0EB03:~/ColossalAI/examples/language/llama/scripts/benchmark_7B# gcc --version gcc (Ubuntu 12.3.0-17ubuntu1) 12.3.0 Copyright (C) 2022 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Do you mean to degrade gcc 12.3 to 9.4?
yes.
@flybird11111 (colo01) root@DESKTOP-5H0EB03:~/ColossalAI/examples/language/llama/scripts/benchmark_7B# sudo update-alternatives --config gcc There are 3 choices for the alternative gcc (providing /usr/bin/gcc).
0 /usr/bin/gcc-12 60 auto mode 1 /usr/bin/gcc-10 60 manual mode
Press
@flybird11111 (colo01) root@DESKTOP-5H0EB03:~/ColossalAI/examples/language/llama/scripts/benchmark_7B# gcc --version gcc (Ubuntu 9.5.0-6ubuntu2) 9.5.0 Copyright (C) 2019 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
(colo01) root@DESKTOP-5H0EB03:~/ColossalAI/examples/language/llama/scripts/benchmark_7B# bash gemini.sh
/root/miniconda3/envs/colo01/lib/python3.10/site-packages/colossalai-0.4.4-py3.10.egg/colossalai/pipeline/schedule/_utils.py:19: FutureWarning: torch.utils._pytree._register_pytree_node
is deprecated. Please use torch.utils._pytree.register_pytree_node
instead.
_register_pytree_node(OrderedDict, _odict_flatten, _odict_unflatten)
/root/miniconda3/envs/colo01/lib/python3.10/site-packages/torch/utils/_pytree.py:332: UserWarning: <class 'collections.OrderedDict'> is already registered as pytree node. Overwriting the previous registration.
warnings.warn(
/root/miniconda3/envs/colo01/lib/python3.10/site-packages/colossalai-0.4.4-py3.10.egg/colossalai/shardformer/layer/normalization.py:45: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel")
/root/miniconda3/envs/colo01/lib/python3.10/site-packages/colossalai-0.4.4-py3.10.egg/colossalai/legacy/registry/init.py:1: DeprecationWarning: TorchScript
support for functional optimizers is deprecated and will be removed in a future PyTorch release. Consider using the torch.compile
optimizer instead.
import torch.distributed.optim as dist_optim
/root/miniconda3/envs/colo01/lib/python3.10/site-packages/colossalai-0.4.4-py3.10.egg/colossalai/accelerator/cuda_accelerator.py:282: FutureWarning: torch.cuda.amp.autocast(args...)
is deprecated. Please use torch.amp.autocast('cuda', args...)
instead.
return torch.cuda.amp.autocast(enabled=enabled, dtype=dtype, cache_enabled=cache_enabled)
[09/23/24 17:52:56] INFO colossalai - colossalai - INFO:
/root/miniconda3/envs/colo01/lib/python3.10/site-pa
ckages/colossalai-0.4.4-py3.10.egg/colossalai/initi
alize.py:75 launch
INFO colossalai - colossalai - INFO: Distributed
environment is initialized, world size: 1
WARNING colossalai - colossalai - WARNING:
/root/miniconda3/envs/colo01/lib/python3.10/site-pa
ckages/colossalai-0.4.4-py3.10.egg/colossalai/boost
er/plugin/gemini_plugin.py:382 init
WARNING colossalai - colossalai - WARNING:
enable_async_reduce sets pin_memory=True to achieve
best performance, which is not implicitly set.
Model params: 615.01 M
[extension] Compiling the JIT cpu_adam_x86 kernel during runtime now
[extension] Time taken to compile cpu_adam_x86 op: 0.08598995208740234 seconds
[extension] Compiling the JIT fused_optim_cuda kernel during runtime now
[extension] Time taken to compile fused_optim_cuda op: 0.09327149391174316 seconds
rank0: Traceback (most recent call last):
rank0: File "/root/miniconda3/envs/colo01/lib/python3.10/site-packages/colossalai-0.4.4-py3.10.egg/colossalai/zero/gemini/chunk/manager.py", line 82, in register_tensor
rank0: File "/root/miniconda3/envs/colo01/lib/python3.10/site-packages/colossalai-0.4.4-py3.10.egg/colossalai/zero/gemini/chunk/chunk.py", line 277, in append_tensor rank0: raise ChunkFullError
rank0: During handling of the above exception, another exception occurred:
rank0: Traceback (most recent call last):
rank0: File "/root/ColossalAI/examples/language/llama/benchmark.py", line 364, in
rank0: File "/root/ColossalAI/examples/language/llama/benchmark.py", line 308, in main rank0: model, optimizer, , dataloader, = booster.boost(model, optimizer, dataloader=dataloader) rank0: File "/root/miniconda3/envs/colo01/lib/python3.10/site-packages/colossalai-0.4.4-py3.10.egg/colossalai/booster/booster.py", line 154, in boost rank0: model, optimizer, criterion, dataloader, lr_scheduler = self.plugin.configure( rank0: File "/root/miniconda3/envs/colo01/lib/python3.10/site-packages/colossalai-0.4.4-py3.10.egg/colossalai/booster/plugin/gemini_plugin.py", line 571, in configure rank0: model = GeminiDDP( rank0: File "/root/miniconda3/envs/colo01/lib/python3.10/site-packages/colossalai-0.4.4-py3.10.egg/colossalai/zero/gemini/gemini_ddp.py", line 183, in init
rank0: File "/root/miniconda3/envs/colo01/lib/python3.10/site-packages/colossalai-0.4.4-py3.10.egg/colossalai/zero/gemini/gemini_ddp.py", line 882, in _init_chunks
rank0: File "/root/miniconda3/envs/colo01/lib/python3.10/site-packages/colossalai-0.4.4-py3.10.egg/colossalai/zero/gemini/chunk/manager.py", line 90, in register_tensor
rank0: File "/root/miniconda3/envs/colo01/lib/python3.10/site-packages/colossalai-0.4.4-py3.10.egg/colossalai/zero/gemini/chunk/manager.py", line 250, in __close_one_chunk
rank0: File "/root/miniconda3/envs/colo01/lib/python3.10/site-packages/colossalai-0.4.4-py3.10.egg/colossalai/zero/gemini/chunk/chunk.py", line 314, in close_chunk
rank0: self.cpu_shard = torch.empty(self.shard_size, dtype=self.dtype, pin_memory=self.pin_memory)
rank0: RuntimeError: CUDA error: out of memory
rank0: Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
Failures:
[rank0]: RuntimeError: CUDA error: out of memory, It seems that the issue is due to insufficient memory.
📚 The doc issue
Win11安装 Ubuntu24.04子系统 WSL2 按照网站指导https://colossalai.org/zh-Hans/docs/get_started/installation 具体按照步骤如下: export CUDA_INSTALL_DIR=/usr/local/cuda-12.1 export CUDA_HOME=/usr/local/cuda-12.1 export LD_LIBRARY_PATH=$CUDA_HOME"/lib64:$LD_LIBRARY_PATH" export PATH=$CUDA_HOME"/bin:$PATH"
conda create -n colo01 python=3.10 conda activate colo01 export PATH=~/miniconda3/envs/colo01/bin:$PATH
sudo apt update sudo apt install gcc-10 g++-10
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 60 sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-12 60 sudo update-alternatives --config gcc gcc --version
wget https://developer.download.nvidia.com/compute/cuda/12.1.0/local_installers/cuda_12.1.0_530.30.02_linux.run sudo sh cuda_12.1.0_530.30.02_linux.run 验证 CUDA 安装:nvidia-smi
conda install nvidia/label/cuda-12.1.0::cuda-toolkit conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
git clone https://github.com/hpcaitech/ColossalAI.git cd ColossalAI
pip install -r requirements/requirements.txt CUDA_EXT=1 pip install .
安装相关的开发库 pip install transformers pip install xformers pip install datasets tensorboard
运行benchmark Step1: 切换目录 cd examples/language/llama/scripts/benchmark_7B 修改gemini.sh bash gemini.sh
执行后提示错误 [rank0]: ImportError: FlashAttention2 has been toggled on, but it cannot be used due to the following error: the package flash_attn seems to be not installed. Please refer to the documentation of https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2 to install Flash Attention 2.
然后安装flashattention-2成功 pip install packaging pip install ninja ninja --version
echo $? conda install -c conda-channel attention2 pip install flash-attn --no-build-isolation
再次执行bash gemini.sh,还是有错误。麻烦根据上传的log文件给予解答,最好能够完善安装文档,谢谢! log.txt