Using distributed or parallel set-up in script?: Yes
Who can help?
@kssteven418
Information
[X] The official example scripts
[ ] My own modified scripts
Tasks
[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)
Reproduction
We have a fine tuned Mistral 7B model that I'm trying to use SqueezeLLM to quantize it for improved performance. But, the gradient generation runs into issues. With one GPU it runs out of memory:
splitting into 1 GPUs
0%| | 0/100 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/var/api/SqueezeLLM-gradients/run.py", line 252, in <module>
train()
File "/var/api/SqueezeLLM-gradients/run.py", line 238, in train
loss.backward()
File "/home/ubuntu/miniconda3/envs/sqllm-grad/lib/python3.9/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/home/ubuntu/miniconda3/envs/sqllm-grad/lib/python3.9/site-packages/torch/autograd/__init__.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/var/api/SqueezeLLM-gradients/run.py", line 226, in square_grad_hook
return grad.pow(2)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB (GPU 0; 21.99 GiB total capacity; 21.37 GiB already allocated; 111.38 MiB free; 21.56 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
So I tried it with 4 GPUs and get a different error:
Commandl line: CUDA_VISIBLE_DEVICES=0,1,2,3 python run.py --output_dir ./fine-tuned-mistral-grad --model_name ./fine-tuned-mistral
splitting into 4 GPUs
cuda:0 for 0
cuda:0 for 1
cuda:0 for 2
cuda:0 for 3
cuda:0 for 4
cuda:0 for 5
cuda:0 for 6
cuda:0 for 7
cuda:1 for 8
cuda:1 for 9
cuda:1 for 10
cuda:1 for 11
cuda:1 for 12
cuda:1 for 13
cuda:1 for 14
cuda:1 for 15
cuda:2 for 16
cuda:2 for 17
cuda:2 for 18
cuda:2 for 19
cuda:2 for 20
cuda:2 for 21
cuda:2 for 22
cuda:2 for 23
cuda:3 for 24
cuda:3 for 25
cuda:3 for 26
cuda:3 for 27
cuda:3 for 28
cuda:3 for 29
cuda:3 for 30
cuda:3 for 31
0%| | 0/100 [00:02<?, ?it/s]
Traceback (most recent call last):
File "/var/api/SqueezeLLM-gradients/run.py", line 252, in <module>
train()
File "/var/api/SqueezeLLM-gradients/run.py", line 236, in train
outputs = model(input_ids=x, labels=x)
File "/home/ubuntu/miniconda3/envs/sqllm-grad/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/var/api/SqueezeLLM-gradients/src/transformers/models/mistral/modeling_mistral.py", line 1041, in forward
outputs = self.model(
File "/home/ubuntu/miniconda3/envs/sqllm-grad/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/var/api/SqueezeLLM-gradients/src/transformers/models/mistral/modeling_mistral.py", line 929, in forward
layer_outputs = decoder_layer(
File "/home/ubuntu/miniconda3/envs/sqllm-grad/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/var/api/SqueezeLLM-gradients/src/transformers/models/mistral/modeling_mistral.py", line 624, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/ubuntu/miniconda3/envs/sqllm-grad/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/var/api/SqueezeLLM-gradients/src/transformers/models/mistral/modeling_mistral.py", line 257, in forward
query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
File "/var/api/SqueezeLLM-gradients/src/transformers/models/mistral/modeling_mistral.py", line 156, in apply_rotary_pos_emb
cos = cos[position_ids].unsqueeze(unsqueeze_dim)
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cuda:1)
I get a similar error with 2 GPUs. Is there something I can do resolve this error? Would a GPU with more memory work?
Expected behavior
Should generate gradient version of the model without running out of memory.
System Info
transformers
version: 4.36.0.dev0 Running on a AWS g5.12xlarge with Ubuntu 22.04:Who can help?
@kssteven418
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
We have a fine tuned Mistral 7B model that I'm trying to use SqueezeLLM to quantize it for improved performance. But, the gradient generation runs into issues. With one GPU it runs out of memory:
Commandl line: CUDA_VISIBLE_DEVICES=0 python run.py --output_dir ./fine-tuned-mistral-grad --model_name ./fine-tuned-mistral
So I tried it with 4 GPUs and get a different error: Commandl line: CUDA_VISIBLE_DEVICES=0,1,2,3 python run.py --output_dir ./fine-tuned-mistral-grad --model_name ./fine-tuned-mistral
I get a similar error with 2 GPUs. Is there something I can do resolve this error? Would a GPU with more memory work?
Expected behavior
Should generate gradient version of the model without running out of memory.