nebuly-ai / optimate

A collection of libraries to optimise AI model performances
https://www.nebuly.com/
Apache License 2.0
8.37k stars 639 forks source link

[ChatLlama] Error in the start of OPT1.3B actor pre-training #284

Closed swang99 closed 1 year ago

swang99 commented 1 year ago

Hello, I am trying to pre-train the actor model but around the 815-816th example, the training stops and shows this very long error message. I had already trained the reward model so I have been using separate commands instead of pipelining them. Any idea what might be causing this? Thank you.

../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [80,0,0], thread: [0,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [188,0,0], thread: [0,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [188,0,0], thread: [127,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cuTraceback (most recent call last): :1141: indexSelectLargeIndex: block: [189,0,0], thread: [126,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu File "/content/drive/MyDrive/Colab Notebooks/llama/artifacts/main.py", line 51, in :1141: indexSelectLargeIndex: block: [189,0,0], thread: [127,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [146,0,0], thread: [32,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [146,0,0], thread: [125,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [146,0,0], thread: [126,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu return forward_call(*input, kwargs) :1141 File "<@beartype(chatllama.rlhf.actor.ActorModel.forward) at 0x7f72e6137040>", line 51, in forward : indexSelectLargeIndex: block: [146,0,0], thread: [127,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [404,0,0], thread: [0,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [404,0,0], thread: [12,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [404,0,0 File "/usr/local/lib/python3.9/dist-packages/chatllama/rlhf/actor.py", line 154, in forward ], thread: [13,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [404,0,0], thread: [14,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [404,0,0], thread: [15,0,0 model_output = self.model.forward( ] Assertion srcIndex < srcSelectDimSize File "/usr/local/lib/python3.9/dist-packages/transformers/models/opt/modeling_opt.py", line 930, in forward failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [404,0,0], thread: [16,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [404,0,0], thread: [17,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [404,0,0], thread: [18,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [404,0,0], thread: [19,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [404,0,0], thread: [20,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141 outputs = self.model.decoder( : indexSelectLargeIndex File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl : block: [404,0,0], thread: [21,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [404,0,0], thread: [22,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [404,0,0], thread: [23,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [404,0,0], thread: [24,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [404,0,0], thread: [25,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [404,0,0], thread: [26,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141 return forward_call(*input, *kwargs) : indexSelectLargeIndex File "/usr/local/lib/python3.9/dist-packages/transformers/models/opt/modeling_opt.py", line 696, in forward : block: [404,0,0], thread: [27,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [404,0,0], thread: [28,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [404,0,0], thread: [29,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [404,0,0], thread: [30,0,0] Assertion srcIndex < srcSelectDimSize layer_outputs = decoder_layer( failed. ../aten/src/ATen/native/cuda/Indexing.cu File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl :1141: indexSelectLargeIndex: block: [404,0,0], thread: [31,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [146,0,0], thread: [0,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [146,0,0], thread: [1,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [146,0,0], thread: [2,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [146,0,0], thread: [3,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex return forward_call(input, kwargs) : block: [146,0 File "/usr/local/lib/python3.9/dist-packages/transformers/models/opt/modeling_opt.py", line 326, in forward ,0], thread: [4,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [146,0,0], thread: [5,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [146,0,0], thread: [6,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex hidden_states, self_attn_weights, present_key_value = self.self_attn( : block: [146,0 File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl ,0], thread: [7,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [146,0,0], thread: [8,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [146,0,0], thread: [9,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [146,0,0], thread: [10,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [146,0,0], thread: [11,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [146,0,0], thread: [12,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [146 return forward_call(*input, kwargs) ,0,0 File "/usr/local/lib/python3.9/dist-packages/transformers/models/opt/modeling_opt.py", line 171, in forward ], thread: [13,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [146,0,0], thread: [14,0,0] Assertion srcIndex < srcSelectDimSize failed. query_states = self.q_proj(hidden_states) self.scaling ../aten/src/ATen/native/cuda/Indexing.cu File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl :1141: indexSelectLargeIndex: block: [146,0,0], thread: [15,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [146,0,0], thread: [16,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [146,0,0], thread: [17,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [146,0,0], thread: [18,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [146,0,0], thread: [19,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [146,0,0], thread: [20,0,0 return forward_call(input, kwargs) ] Assertion srcIndex < srcSelectDimSize failed. File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/linear.py", line 114, in forward ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [146,0,0], thread: [21,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [146,0,0], thread: [22,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [146,0,0], thread: [23,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [146,0,0], thread: [24,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [146,0,0], thread: [25,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [146,0,0], thread: [26,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [146,0,0], thread: [27,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [146,0,0], thread: [28,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [146,0,0], thread: [29,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [146,0,0], thread: [30,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [146,0,0], thread: [31,0,0] Assertion srcIndex < srcSelectDimSize failed. return F.linear(input, self.weight, self.bias) RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasLtMatmul( ltHandle, computeDesc.descriptor(), &alpha_val, mat1_ptr, Adesc.descriptor(), mat2_ptr, Bdesc.descriptor(), &beta_val, result_ptr, Cdesc.descriptor(), result_ptr, Cdesc.descriptor(), &heuristicResult.algo, workspace.data_ptr(), workspaceSize, at::cuda::getCurrentCUDAStream())

PierpaoloSorbellini commented 1 year ago

Hi @swang99 Thanks for reaching out! This is a known error that depends on the sequence length of some samples of the dataset that is too long for the model. This should be fixed in the PR #233.