Open tginart opened 1 year ago
This problem also occurred when I reproduced the 20B model
Hi , did you find a solution ? I am facing the same problem. I am trying to test alpa for distributed parallel training. NVIDIA-SMI 470.182.03 Driver Version: 470.182.03 CUDA Version: 11.4
nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2021 NVIDIA Corporation Built on Sun_Feb_14_21:12:58_PST_2021 Cuda compilation tools, release 11.2, V11.2.152 Build cuda_11.2.r11.2/compiler.29618528_0
When I run python3 -m alpa.test_install
:
File "cupy_backends/cuda/libs/nccl.pyx", line 283, in cupy_backends.cuda.libs.nccl.NcclCommunicator.init
File "cupy_backends/cuda/libs/nccl.pyx", line 129, in cupy_backends.cuda.libs.nccl.check_status
cupy_backends.cuda.libs.nccl.NcclError: NCCL_ERROR_UNHANDLED_CUDA_ERROR: unhandled cuda error
Any help would be really appreciated. I tried different versions of cuda , same error every time..
also same error...
Describe the bug Running the Pythia-7B fine-tune script on 4 x A10 (24GB each).
Seems like issue with seq len:
_``` Token indices sequence length is longer than the specified maximum sequence length for this model (4158 > 2048). Running this sequence through the model will result in indexing errors Traceback (most recent call last): File "/home/ec2-user/OpenChatKit/training/dist_clm_train.py", line 358, in
main()
File "/home/ec2-user/OpenChatKit/training/dist_clm_train.py", line 332, in main
train_loop(args, pipe, device, train_data_loader, test_data_loader)
File "/home/ec2-user/OpenChatKit/training/dist_clm_train.py", line 151, in train_loop
get_data_parallel_comm().recv(
File "/home/ec2-user/OpenChatKit/training/comm/nccl_backend.py", line 79, in recv
self.comm.recv(
File "cupy_backends/cuda/libs/nccl.pyx", line 477, in cupy_backends.cuda.libs.nccl.NcclCommunicator.recv
File "cupy_backends/cuda/libs/nccl.pyx", line 129, in cupy_backends.cuda.libs.nccl.check_status
cupy_backends.cuda.libs.nccl.NcclError: NCCL_ERROR_UNHANDLED_CUDA_ERROR: unhandled cuda error
Traceback (most recent call last):
File "/home/ec2-user/OpenChatKit/training/dist_clm_train.py", line 358, in
main()
File "/home/ec2-user/OpenChatKit/training/dist_clm_train.py", line 332, in main
train_loop(args, pipe, device, train_data_loader, test_data_loader)
File "/home/ec2-user/OpenChatKit/training/dist_clm_train.py", line 117, in train_loop
get_data_parallel_comm().send(
File "/home/ec2-user/OpenChatKit/training/comm/nccl_backend.py", line 65, in send
self.comm.send(
File "cupy_backends/cuda/libs/nccl.pyx", line 468, in cupy_backends.cuda.libs.nccl.NcclCommunicator.send
File "cupy_backends/cuda/libs/nccl.pyx", line 129, in cupy_backends.cuda.libs.nccl.check_status
cupy_backends.cuda.libs.nccl.NcclError: NCCL_ERROR_UNHANDLED_CUDA_ERROR: unhandled cuda error