CUDA errors running SqueezeLLM-gradients

System Info

transformers version: 4.36.0.dev0 Running on a AWS g5.12xlarge with Ubuntu 22.04:
Platform: Linux-6.2.0-1018-aws-x86_64-with-glibc2.35
Python version: 3.9.18
Huggingface_hub version: 0.20.3
Safetensors version: 0.4.2
Accelerate version: 0.27.0
Accelerate config: not found
PyTorch version (GPU?): 2.0.1+cu117 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: Yes

Who can help?

@kssteven418

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

We have a fine tuned Mistral 7B model that I'm trying to use SqueezeLLM to quantize it for improved performance. But, the gradient generation runs into issues. With one GPU it runs out of memory:

Commandl line: CUDA_VISIBLE_DEVICES=0 python run.py --output_dir ./fine-tuned-mistral-grad --model_name ./fine-tuned-mistral

splitting into 1 GPUs
  0%|          | 0/100 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/var/api/SqueezeLLM-gradients/run.py", line 252, in <module>
    train()
  File "/var/api/SqueezeLLM-gradients/run.py", line 238, in train
    loss.backward()
  File "/home/ubuntu/miniconda3/envs/sqllm-grad/lib/python3.9/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/home/ubuntu/miniconda3/envs/sqllm-grad/lib/python3.9/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/var/api/SqueezeLLM-gradients/run.py", line 226, in square_grad_hook
    return grad.pow(2)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB (GPU 0; 21.99 GiB total capacity; 21.37 GiB already allocated; 111.38 MiB free; 21.56 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

So I tried it with 4 GPUs and get a different error: Commandl line: CUDA_VISIBLE_DEVICES=0,1,2,3 python run.py --output_dir ./fine-tuned-mistral-grad --model_name ./fine-tuned-mistral

splitting into 4 GPUs
cuda:0 for 0
cuda:0 for 1
cuda:0 for 2
cuda:0 for 3
cuda:0 for 4
cuda:0 for 5
cuda:0 for 6
cuda:0 for 7
cuda:1 for 8
cuda:1 for 9
cuda:1 for 10
cuda:1 for 11
cuda:1 for 12
cuda:1 for 13
cuda:1 for 14
cuda:1 for 15
cuda:2 for 16
cuda:2 for 17
cuda:2 for 18
cuda:2 for 19
cuda:2 for 20
cuda:2 for 21
cuda:2 for 22
cuda:2 for 23
cuda:3 for 24
cuda:3 for 25
cuda:3 for 26
cuda:3 for 27
cuda:3 for 28
cuda:3 for 29
cuda:3 for 30
cuda:3 for 31
  0%|          | 0/100 [00:02<?, ?it/s]
Traceback (most recent call last):
  File "/var/api/SqueezeLLM-gradients/run.py", line 252, in <module>
    train()
  File "/var/api/SqueezeLLM-gradients/run.py", line 236, in train
    outputs = model(input_ids=x, labels=x)
  File "/home/ubuntu/miniconda3/envs/sqllm-grad/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/var/api/SqueezeLLM-gradients/src/transformers/models/mistral/modeling_mistral.py", line 1041, in forward
    outputs = self.model(
  File "/home/ubuntu/miniconda3/envs/sqllm-grad/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/var/api/SqueezeLLM-gradients/src/transformers/models/mistral/modeling_mistral.py", line 929, in forward
    layer_outputs = decoder_layer(
  File "/home/ubuntu/miniconda3/envs/sqllm-grad/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/var/api/SqueezeLLM-gradients/src/transformers/models/mistral/modeling_mistral.py", line 624, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/ubuntu/miniconda3/envs/sqllm-grad/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/var/api/SqueezeLLM-gradients/src/transformers/models/mistral/modeling_mistral.py", line 257, in forward
    query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
  File "/var/api/SqueezeLLM-gradients/src/transformers/models/mistral/modeling_mistral.py", line 156, in apply_rotary_pos_emb
    cos = cos[position_ids].unsqueeze(unsqueeze_dim)
    RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cuda:1)

I get a similar error with 2 GPUs. Is there something I can do resolve this error? Would a GPU with more memory work?

Expected behavior

Should generate gradient version of the model without running out of memory.

kssteven418 / SqueezeLLM-gradients