lm-sys / FastChat

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
Apache License 2.0
36.89k stars 4.54k forks source link

RuntimeError: CUDA error: device-side assert triggered #2113

Open lw3259111 opened 1 year ago

lw3259111 commented 1 year ago

I am getting the following error when trying to fine-tune the 7B models from a Llama 2 base:

../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [640,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed.                                                   
Traceback (most recent call last):                                                                                                                                                                          
  File "/home/ubuntu/project/project/fastchat/FastChat/fastchat/train/train_mem.py", line 13, in <module>                                                                                                   
    train()                                                                                                                                                                                                 
  File "/home/ubuntu/project/project/fastchat/FastChat/fastchat/train/train.py", line 270, in train                                                                                                         
    trainer.train(resume_from_checkpoint=True)                                                                                                                                                              
  File "/home/ubuntu/miniconda3/envs/lmflow/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train                                                                                      
    return inner_training_loop(                                                                                                                                                                             
  File "/home/ubuntu/miniconda3/envs/lmflow/lib/python3.10/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop                                                                       
    tr_loss_step = self.training_step(model, inputs)                                                                                                                                                        
  File "/home/ubuntu/miniconda3/envs/lmflow/lib/python3.10/site-packages/transformers/trainer.py", line 2654, in training_step                                                                              
    loss = self.compute_loss(model, inputs)                                                                                                                                                                 
  File "/home/ubuntu/miniconda3/envs/lmflow/lib/python3.10/site-packages/transformers/trainer.py", line 2679, in compute_loss                                                                               
    outputs = model(**inputs)                                                                                                                                                                               
  File "/home/ubuntu/miniconda3/envs/lmflow/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl                                                                              
    return forward_call(*args, **kwargs)                                                                                                                                                                    
  File "/home/ubuntu/miniconda3/envs/lmflow/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn                                                                                   
    ret_val = func(*args, **kwargs)                                                                                                                                                                         
  File "/home/ubuntu/miniconda3/envs/lmflow/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1769, in forward
    loss = self.module(*inputs, **kwargs)
  File "/home/ubuntu/miniconda3/envs/lmflow/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/lmflow/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 806, in forward
    outputs = self.model(
  File "/home/ubuntu/miniconda3/envs/lmflow/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/lmflow/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 646, in forward
    inputs_embeds = self.embed_tokens(input_ids)
  File "/home/ubuntu/miniconda3/envs/lmflow/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/lmflow/lib/python3.10/site-packages/torch/nn/modules/sparse.py", line 162, in forward
    return F.embedding(
  File "/home/ubuntu/miniconda3/envs/lmflow/lib/python3.10/site-packages/torch/nn/functional.py", line 2210, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered 
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Environment

surak commented 1 year ago

This is a PyTorch problem, not FastChat. What is the exact version/wheel file you are actually using? Can you try with 2.0.0 + cu117?

azulika commented 1 year ago

same problem

surak commented 1 year ago

@azulika would you mind showing the version of pytorch and cuda?

azulika commented 1 year ago

torch==2.2.0.dev20231030+cu121

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:09:35_Pacific_Daylight_Time_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0