Closed psopha closed 1 year ago
GPU Memory Usage: 0 0 MiB 1 0 MiB Now CUDA_VISIBLE_DEVICES is set to: CUDA_VISIBLE_DEVICES=0,1 [02/28/23 15:42:36] INFO colossalai - colossalai - INFO: /home/anaconda3/lib/python3.9/site-packages/colossalai/context/parallel_context.py:521 set_device [02/28/23 15:42:36] INFO colossalai - colossalai - INFO: /home/anaconda3/lib/python3.9/site-packages/colossalai/context/parallel_context.py:521 set_device INFO colossalai - colossalai - INFO: process rank 0 is bound to device 0 INFO colossalai - colossalai - INFO: process rank 1 is bound to device 1 [02/28/23 15:42:38] INFO colossalai - colossalai - INFO: /home/anaconda3/lib/python3.9/site-packages/colossalai/context/parallel_context.py:557 set_seed [02/28/23 15:42:38] INFO colossalai - colossalai - INFO: /home/anaconda3/lib/python3.9/site-packages/colossalai/context/parallel_context.py:557 set_seed INFO colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 42, python random: 42, ParallelMode.DATA: 42, ParallelMode.TENSOR: 42,the default parallel seed is ParallelMode.DATA. INFO colossalai - colossalai - INFO: initialized seed on rank 1, numpy: 42, python random: 42, ParallelMode.DATA: 42, ParallelMode.TENSOR: 42,the default parallel seed is ParallelMode.DATA. INFO colossalai - colossalai - INFO: /home/anaconda3/lib/python3.9/site-packages/colossalai/initialize.py:116 launch INFO colossalai - colossalai - INFO: Distributed environment is initialized, data parallel size: 2, pipeline parallel size: 1, tensor parallel size: 1 OP colossalai._C.cpu_adam already exists, skip building. Time to load cpu_adam op: 0.0008428096771240234 seconds OP colossalai._C.fused_optim already exists, skip building. Time to load fused_optim op: 3.5762786865234375e-05 seconds OP colossalai._C.cpu_adam already exists, skip building. Time to load cpu_adam op: 4.887580871582031e-05 seconds OP colossalai._C.fused_optim already exists, skip building. Time to load fused_optim op: 4.100799560546875e-05 seconds Episode [1/50]: 0%| | 0/10 [00:00<?, ?it/s]Traceback (most recent call last): File "/home/ColossalAI/applications/ChatGPT/train_dummy.py", line 122, in <module> main(args) File "/home/ColossalAI/applications/ChatGPT/train_dummy.py", line 95, in main trainer.fit(random_prompts, File "/home/ColossalAI/applications/ChatGPT/chatgpt/trainer/base.py", line 114, in fit Episode [1/50]: 0%| | 0/10 [00:00<?, ?it/s] Traceback (most recent call last): File "/home/ColossalAI/applications/ChatGPT/train_dummy.py", line 122, in <module> main(args) File "/home/ColossalAI/applications/ChatGPT/train_dummy.py", line 95, in main experience = self._make_experience(inputs) File "/home/ColossalAI/applications/ChatGPT/chatgpt/trainer/base.py", line 65, in _make_experience trainer.fit(random_prompts, File "/home/ColossalAI/applications/ChatGPT/chatgpt/trainer/base.py", line 114, in fit return self.experience_maker.make_experience(**inputs, **self.generate_kwargs) File "/home/anaconda3/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/home/ColossalAI/applications/ChatGPT/chatgpt/experience_maker/naive.py", line 19, in make_experience experience = self._make_experience(inputs) File "/home/ColossalAI/applications/ChatGPT/chatgpt/trainer/base.py", line 65, in _make_experience sequences, attention_mask, action_mask = self.actor.generate(input_ids, File "/home/anaconda3/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/home/ColossalAI/applications/ChatGPT/chatgpt/nn/actor.py", line 34, in generate return self.experience_maker.make_experience(**inputs, **self.generate_kwargs) File "/home/anaconda3/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/home/ColossalAI/applications/ChatGPT/chatgpt/experience_maker/naive.py", line 19, in make_experience sequences, attention_mask, action_mask = self.actor.generate(input_ids, File "/home/anaconda3/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/home/ColossalAI/applications/ChatGPT/chatgpt/nn/actor.py", line 34, in generate sequences = generate(self.model, input_ids, **kwargs) File "/home/ColossalAI/applications/ChatGPT/chatgpt/nn/generation.py", line 122, in generate sequences = generate(self.model, input_ids, **kwargs) File "/home/ColossalAI/applications/ChatGPT/chatgpt/nn/generation.py", line 122, in generate return sample(model, File "/home/ColossalAI/applications/ChatGPT/chatgpt/nn/generation.py", line 52, in sample outputs = model(**model_inputs) File "/home/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/anaconda3/lib/python3.9/site-packages/transformers/models/opt/modeling_opt.py", line 928, in forward return sample(model, File "/home/ColossalAI/applications/ChatGPT/chatgpt/nn/generation.py", line 52, in sample outputs = self.model.decoder( File "/home/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/anaconda3/lib/python3.9/site-packages/transformers/models/opt/modeling_opt.py", line 643, in forward inputs_embeds = self.project_in(inputs_embeds) File "/home/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/anaconda3/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward return F.linear(input, self.weight, self.bias) RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)` outputs = model(**model_inputs) File "/home/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/anaconda3/lib/python3.9/site-packages/transformers/models/opt/modeling_opt.py", line 928, in forward outputs = self.model.decoder( File "/home/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/anaconda3/lib/python3.9/site-packages/transformers/models/opt/modeling_opt.py", line 643, in forward inputs_embeds = self.project_in(inputs_embeds) File "/home/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/anaconda3/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward return F.linear(input, self.weight, self.bias) RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)` ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 36578) of binary: /home/anaconda3/bin/python Traceback (most recent call last): File "/home/anaconda3/bin/torchrun", line 33, in <module> sys.exit(load_entry_point('torch==1.12.1', 'console_scripts', 'torchrun')()) File "/home/anaconda3/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/home/anaconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 761, in main run(args) File "/home/anaconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 752, in run elastic_launch( File "/home/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ train_dummy.py FAILED ### Environment colossalai:0.2.5 PyTorch version: 1.12.1 Is debug build: False CUDA used to build PyTorch: 11.3 ROCM used to build PyTorch: N/A OS: CentOS Linux 7 (Core) (x86_64) GCC version: (GCC) 8.3.1 20190311 (Red Hat 8.3.1-3) Clang version: Could not collect CMake version: version 3.22.5 Libc version: glibc-2.17 Python version: 3.9.13 (main, Aug 25 2022, 23:26:10) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-4.18.0-147.mt20200626.413.el8_1.x86_64-x86_64-with-glibc2.17 Is CUDA available: True CUDA runtime version: 11.3.58 GPU models and configuration: GPU 0: Tesla V100-SXM2-32GB GPU 1: Tesla V100-SXM2-32GB Nvidia driver version: 470.82.01 cuDNN version: Probably one of the following: /usr/lib64/libcudnn.so.8.2.1 /usr/lib64/libcudnn_adv_infer.so.8.2.1 /usr/lib64/libcudnn_adv_train.so.8.2.1 /usr/lib64/libcudnn_cnn_infer.so.8.2.1 /usr/lib64/libcudnn_cnn_train.so.8.2.1 /usr/lib64/libcudnn_ops_infer.so.8.2.1 /usr/lib64/libcudnn_ops_train.so.8.2.1 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True Versions of relevant libraries: [pip3] mypy-extensions==0.4.3 [pip3] numpy==1.21.5 [pip3] numpydoc==1.4.0 [pip3] torch==1.12.1+cu113 [pip3] torchaudio==0.12.1+cu113 [pip3] torchvision==0.13.1+cu113 [conda] blas 1.0 mkl [conda] cudatoolkit 11.3.1 h2bc3f7f_2 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main [conda] ffmpeg 4.3 hf484d3e_0 pytorch [conda] mkl 2021.4.0 h06a4308_640 [conda] mkl-service 2.4.0 py39h7f8727e_0 [conda] mkl_fft 1.3.1 py39hd3c417c_0 [conda] mkl_random 1.2.2 py39h51133e4_0 [conda] numpy 1.21.5 py39h6c91a56_3 [conda] numpy-base 1.21.5 py39ha15fc14_3 [conda] numpydoc 1.4.0 py39h06a4308_0 [conda] pytorch 1.12.1 py3.9_cuda11.3_cudnn8.3.2_0 pytorch [conda] pytorch-mutex 1.0 cuda [conda] torch 1.12.1+cu113 pypi_0 pypi [conda] torchaudio 0.12.1+cu113 pypi_0 pypi [conda] torchvision 0.13.1+cu113 pypi_0 pypi
i reinstall cuda 11.3 ,then work fine
🐛 Describe the bug