Open dshrikumar opened 1 year ago
Could you provide some details.
What type of GPU are you trying to run this on? How many GPUs? What is the GPU memory?
Are you trying to generate sentence or paragraph answers? How many training examples do you have?
Sure, Thank for your response.
I'm running it on a single GPU - Nvidia Gefore RTX 3080 Laptop GPU. GPU memory is 16GB.
I'm finetuning the model using HuggingFace transformers library for question answering. I'm trying to generate sentence answers. To start off, I tried just using around 500 training examples.
The dataset csv is of the format:
instruction,output
sample instruction,sample ouput
How much RAM does your laptop have? There are ways to train the model on a single GPU with cpu offloading but you need 50GB RAM on the machine as well I believe.
I am hoping to update the code and post some new instructions on various fine-tuning scenarios.
But it sounds like you want to process 500 prompt --> response pairs ...
Or to put it another way ... I have gotten single GPU training working and it starts at 14GB on GPU and ends up at 18.8GB ... and it looks like it is using 50GB of RAM on the machine ... I am not sure what would happen with the resources you have ...
Laptop has 24GB Ram. Yes you're right, I want to Process 500 prompt response pairs.
But, I'm using Low Rank Adaptation approach. I was able to finetune Repajama 3B model with same dataset on this laptop with Low Rank Adaptation(with and without 8bit optimisation).
So, does
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling
`cublasCreate(handle)
Mean that this is to do with the GPU and Ram?
And sure, thanks. Updates on Fine tuning scenarios would really help.
I will see if I can get the LoRA version working, I have never tried that ...
Sure, thank you!
Hi, @J38 . Actually, I encountered the same problem. I used the code in ./finetune/textgen
for text generation, and I processed my data into the expected format. I trained it on 4*A100 40G, and it was running fine. However, at a certain iteration, the training terminated, and the following error occurred:
0%| | 97/30152 [06:44<37:45:09, 4.52s/it]/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [32,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [33,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [34,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [35,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [36,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [37,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [38,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [39,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [40,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [41,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [42,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [43,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [44,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [45,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [46,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [47,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [48,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [49,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [50,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [51,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [52,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [53,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [54,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [55,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [56,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [57,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [58,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [59,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [60,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [61,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [62,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [63,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [0,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [1,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [2,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [3,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [4,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [5,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [6,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [7,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [8,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [9,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [10,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [11,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [12,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [13,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [14,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [15,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [16,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [17,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [18,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [19,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [20,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [21,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [22,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [23,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [24,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [25,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [26,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [27,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [28,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [29,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [30,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [31,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [97,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [98,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [99,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [100,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [101,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [102,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [103,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [104,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [105,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [106,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [107,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [108,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [109,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [110,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [111,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [112,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [113,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [114,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [115,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [116,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [117,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [118,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [119,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [120,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [121,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [122,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [123,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [124,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [125,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [126,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [64,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [65,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [66,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [67,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [68,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [69,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [70,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [71,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [72,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [73,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [74,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [75,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [76,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [77,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [78,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [79,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [80,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [81,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [82,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [83,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [84,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [85,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [86,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [87,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [88,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [89,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [90,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [91,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [92,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [93,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [94,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [609,0,0], thread: [95,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Traceback (most recent call last):
File "gpt2/finetune_for_summarization.py", line 162, in <module>
finetune()
File "gpt2/finetune_for_summarization.py", line 156, in finetune
trainer.train()
File "/home/.conda/envs/pubmedgpt/lib/python3.8/site-packages/transformers/trainer.py", line 1543, in train
return inner_training_loop(
File "/home/.conda/envs/pubmedgpt/lib/python3.8/site-packages/transformers/trainer.py", line 1791, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/.conda/envs/pubmedgpt/lib/python3.8/site-packages/transformers/trainer.py", line 2539, in training_step
loss = self.compute_loss(model, inputs)
File "/home/.conda/envs/pubmedgpt/lib/python3.8/site-packages/transformers/trainer.py", line 2571, in compute_loss
outputs = model(**inputs)
File "/home/.conda/envs/pubmedgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/.conda/envs/pubmedgpt/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/.conda/envs/pubmedgpt/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1724, in forward
loss = self.module(*inputs, **kwargs)
File "/home/.conda/envs/pubmedgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/.conda/envs/pubmedgpt/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1043, in forward
transformer_outputs = self.transformer(
File "/home/.conda/envs/pubmedgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/.conda/envs/pubmedgpt/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 877, in forward
outputs = torch.utils.checkpoint.checkpoint(
File "/home/.conda/envs/pubmedgpt/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 235, in checkpoint
return CheckpointFunction.apply(function, preserve, *args)
File "/home/.conda/envs/pubmedgpt/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 96, in forward
outputs = run_function(*args)
File "/home/.conda/envs/pubmedgpt/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 873, in custom_forward
return module(*inputs, use_cache, output_attentions)
File "/home/.conda/envs/pubmedgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/.conda/envs/pubmedgpt/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 388, in forward
attn_outputs = self.attn(
File "/home/.conda/envs/pubmedgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/.conda/envs/pubmedgpt/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 329, in forward
attn_output, attn_weights = self._attn(query, key, value, attention_mask, head_mask)
File "/home/.conda/envs/pubmedgpt/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 199, in _attn
mask_value = torch.full([], mask_value, dtype=attn_weights.dtype).to(attn_weights.device)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.```
What puzzles me is that I repeated the training three times, and each time I couldn't complete a full epoch. Moreover, the iterations where the failure occurred seemed quite random. I suspect it has something to do with my dataset and tokenizer. In fact, I came across some discussions where they suggested removing special characters from my dataset, but that didn't work either. Is there anything specific I can investigate? Thank you very much. @J38
@s1ghhh I encountered a similar error while working on a different dataset and I guess it is due to the input length of the model. As mentioned by @J38, the model was trained with a fixed context length of 1024, so the source, target, and extra tokens have to fit within that size. Try changing the size of the input train target length and see if it works. It worked for me though.
Hi, I'm trying to finetune the BioMedLM for Medical Question Answering using our custom dataset using Hugging Face's transformer's library. Since we're looking to optimize the memory usage, we're using Low Rank Adaptation as well. I'm unsure of the format of the dataset that I need to use. Below is the one I'm using currently: { 'instruction': 'xyz', 'output': 'test'}, where instruction is the question and output is the answer. Below is my code - ```
import logging import torch from datasets import Dataset import pandas as pd import gc from transformers import DataCollatorForLanguageModeling from transformers import TrainingArguments, Trainer, AutoModelForCausalLM, AutoTokenizer from peft import LoraConfig, get_peft_model, prepare_model_for_int8_training, TaskType
logging.basicConfig(level=logging.DEBUG)
--------------------------------------------------------------------------------------------------------------
print("creating tokenizer from model") model_name="stanford-crfm/BioMedLM"
tokenizer = AutoTokenizer.from_pretrained(model_name,add_eos_token=True) tokenizer.pad_token_id = 0'})
print('eos_token_id:',tokenizer.eos_token_id)
tokenizer.add_special_tokens({'eos_token':'
device_type = "cuda" if torch.cuda.is_available() else "cpu"
device = torch.device(device_type) model = AutoModelForCausalLM.from_pretrained( model_name, ).to(device) model.tie_weights() # todo: understand we dont know what this doing
--------------------------------------------------------------------------------------------------------------
peft_name = 'output/biomedLM-lora' CUTOFF_LEN = 512
def tokenize(prompt, tokenizer, add_eos_token=True): result = tokenizer( prompt+"", # add the end-of-stream token
truncation=True,
max_length=CUTOFF_LEN,
padding="max_length",
)
return {
"input_ids": result["input_ids"],
"attention_mask": result["attention_mask"],
}
print("loading data from csv") df = pd.read_csv("dataset.csv") dataset = Dataset.from_pandas(df) dataset = dataset.select_columns(['instruction', 'output'])
print("splitting dataset") dataset = dataset.train_test_split(test_size = 0.33) train_data = dataset["train"] val_data = dataset["test"]
def generate_prompt(data_point): return f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.
Instruction:
{data_point["instruction"]}Response: {data_point["output"]}"""
print("tokenizing train and val ds") train_data = train_data.shuffle().map(lambda x: tokenize(generate_prompt(x), tokenizer)) val_data = val_data.shuffle().map(lambda x: tokenize(generate_prompt(x), tokenizer))
lora_config = LoraConfig( r = 8, lora_alpha=16, target_modules=["c_attn"], lora_dropout=0.05, bias="none", task_type=TaskType.CAUSAL_LM )
eval_steps = 50 save_steps = 50 logging_steps = 20
trainer = Trainer( model=model, train_dataset=train_data, eval_dataset=val_data, args=TrainingArguments( num_train_epochs=1, learning_rate=1e-5, logging_steps=logging_steps, evaluation_strategy="steps", save_strategy="steps", eval_steps=eval_steps, save_steps=save_steps, output_dir="./models", # where model is saved report_to="none", save_total_limit=3, load_best_model_at_end=True, push_to_hub=False, per_device_train_batch_size=1, # defines per batch size per_device_eval_batch_size=1 ), data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False), # wtf is this )
model.config.use_cache = False # silence the warnings. Please re-enable for inference!
print("training") trainer.train()
print("saving model") trainer.model.save_pretrained(peft_name) tokenizer.save_pretrained(peft_name)
--------------------------------------------------------------------------------------------------------------../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [472,0,0], thread: [126,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [472,0,0], thread: [127,0,0] Assertion
srcIndex < srcSelectDimSize
failed.print("cleanup") model = None tokenizer=None trainer=None gc.collect() torch.cuda.empty_cache()
--------------------------------------------------------------------------------------------------------------