togethercomputer / OpenChatKit

Apache License 2.0
9k stars 1.01k forks source link

Token indices sequence length is longer than the specified maximum sequence length for this model (4158 > 2048) #102

Open tginart opened 1 year ago

tginart commented 1 year ago

Describe the bug Running the Pythia-7B fine-tune script on 4 x A10 (24GB each).

Seems like issue with seq len:

_``` Token indices sequence length is longer than the specified maximum sequence length for this model (4158 > 2048). Running this sequence through the model will result in indexing errors Traceback (most recent call last): File "/home/ec2-user/OpenChatKit/training/dist_clm_train.py", line 358, in main() File "/home/ec2-user/OpenChatKit/training/dist_clm_train.py", line 332, in main train_loop(args, pipe, device, train_data_loader, test_data_loader) File "/home/ec2-user/OpenChatKit/training/dist_clm_train.py", line 151, in train_loop get_data_parallel_comm().recv( File "/home/ec2-user/OpenChatKit/training/comm/nccl_backend.py", line 79, in recv self.comm.recv( File "cupy_backends/cuda/libs/nccl.pyx", line 477, in cupy_backends.cuda.libs.nccl.NcclCommunicator.recv File "cupy_backends/cuda/libs/nccl.pyx", line 129, in cupy_backends.cuda.libs.nccl.check_status cupy_backends.cuda.libs.nccl.NcclError: NCCL_ERROR_UNHANDLED_CUDA_ERROR: unhandled cuda error Traceback (most recent call last): File "/home/ec2-user/OpenChatKit/training/dist_clm_train.py", line 358, in main() File "/home/ec2-user/OpenChatKit/training/dist_clm_train.py", line 332, in main train_loop(args, pipe, device, train_data_loader, test_data_loader) File "/home/ec2-user/OpenChatKit/training/dist_clm_train.py", line 117, in train_loop get_data_parallel_comm().send( File "/home/ec2-user/OpenChatKit/training/comm/nccl_backend.py", line 65, in send self.comm.send( File "cupy_backends/cuda/libs/nccl.pyx", line 468, in cupy_backends.cuda.libs.nccl.NcclCommunicator.send File "cupy_backends/cuda/libs/nccl.pyx", line 129, in cupy_backends.cuda.libs.nccl.check_status cupy_backends.cuda.libs.nccl.NcclError: NCCL_ERROR_UNHANDLED_CUDA_ERROR: unhandled cuda error



**To Reproduce**
Steps to reproduce the behavior:
Run Pythia train script with following modifications:

`--num-layers 16 --embedding-dim 4096 \
--world-size 4 --pipeline-group-size 2 --data-group-size 2 \`

**Expected behavior**
Training should work? 
Using standard AWS deep learning AMI with cuda
yangnicejin commented 1 year ago

This problem also occurred when I reproduced the 20B model

haifaksh commented 1 year ago

Hi , did you find a solution ? I am facing the same problem. I am trying to test alpa for distributed parallel training. NVIDIA-SMI 470.182.03 Driver Version: 470.182.03 CUDA Version: 11.4

nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2021 NVIDIA Corporation Built on Sun_Feb_14_21:12:58_PST_2021 Cuda compilation tools, release 11.2, V11.2.152 Build cuda_11.2.r11.2/compiler.29618528_0

When I run python3 -m alpa.test_install: File "cupy_backends/cuda/libs/nccl.pyx", line 283, in cupy_backends.cuda.libs.nccl.NcclCommunicator.init File "cupy_backends/cuda/libs/nccl.pyx", line 129, in cupy_backends.cuda.libs.nccl.check_status cupy_backends.cuda.libs.nccl.NcclError: NCCL_ERROR_UNHANDLED_CUDA_ERROR: unhandled cuda error

Any help would be really appreciated. I tried different versions of cuda , same error every time..

huhuiqi7 commented 1 month ago

also same error...