first of all, it works, only use 6-7G gpu memory loading 7B model, but in the stage of forward, the gpu memory will increase rapidly and then CUDA out of memory.
Have you ever been in this situation?
GPU: tesla T4 15G
error trace:
Load model with 6.87GB.
Traceback (most recent call last):
File "scripts/generate_lm_int8.py", line 112, in
output = model(src_tensor, seg_tensor)
File "/home/ubuntu/miniconda3/envs/fyh-3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, kwargs)
File "/home/ubuntu/miniconda3/envs/fyh-3.8/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, *kwargs)
File "scripts/generate_lm_int8.py", line 39, in forward
output = self.encoder(emb, seg)
File "/home/ubuntu/miniconda3/envs/fyh-3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(input, kwargs)
File "/home/ubuntu/workspace_fyh/TencentPretrainQuan/tencentpretrain/encoders/transformer_encoder.py", line 142, in forward
hidden, prev_attn = self.transformer[i](hidden, mask, position_bias=position_bias,
File "/home/ubuntu/miniconda3/envs/fyh-3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, kwargs)
File "/home/ubuntu/workspace_fyh/TencentPretrainQuan/tencentpretrain/layers/transformer.py", line 80, in forward
output = self.dropout_2(self.feed_forward(output)) + hidden
File "/home/ubuntu/miniconda3/envs/fyh-3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, *kwargs)
File "/home/ubuntu/workspace_fyh/TencentPretrainQuan/tencentpretrain/layers/position_ffn.py", line 30, in forward
gate = self.act(self.linear_gate(x))
File "/home/ubuntu/miniconda3/envs/fyh-3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(input, kwargs)
File "/home/ubuntu/miniconda3/envs/fyh-3.8/lib/python3.8/site-packages/bitsandbytes/nn/modules.py", line 242, in forward
out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
File "/home/ubuntu/miniconda3/envs/fyh-3.8/lib/python3.8/site-packages/bitsandbytes/autograd/_functions.py", line 488, in matmul
return MatMul8bitLt.apply(A, B, out, bias, state)
File "/home/ubuntu/miniconda3/envs/fyh-3.8/lib/python3.8/site-packages/bitsandbytes/autograd/_functions.py", line 338, in forward
) = F.double_quant(B.to(torch.float16))
File "/home/ubuntu/miniconda3/envs/fyh-3.8/lib/python3.8/site-packages/bitsandbytes/nn/modules.py", line 199, in to
super().to(
RuntimeError: CUDA out of memory. Tried to allocate 86.00 MiB (GPU 0; 14.62 GiB total capacity; 12.67 GiB already allocated; 11.38 MiB free; 13.34 GiB reserved in total by PyTorch)
Hi, i try to add int8 inference of llama in my code, but i don't want to edit my original model structure. So i try similar to your quantize: https://github.com/tloen/llama-int8/blob/ce74669c767e42b5082391dd0cfcb621ba40c7f9/llama/model.py#L286
first of all, it works, only use 6-7G gpu memory loading 7B model, but in the stage of forward, the gpu memory will increase rapidly and then CUDA out of memory. Have you ever been in this situation? GPU: tesla T4 15G
error trace: Load model with 6.87GB. Traceback (most recent call last): File "scripts/generate_lm_int8.py", line 112, in
output = model(src_tensor, seg_tensor)
File "/home/ubuntu/miniconda3/envs/fyh-3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, kwargs)
File "/home/ubuntu/miniconda3/envs/fyh-3.8/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, *kwargs)
File "scripts/generate_lm_int8.py", line 39, in forward
output = self.encoder(emb, seg)
File "/home/ubuntu/miniconda3/envs/fyh-3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(input, kwargs)
File "/home/ubuntu/workspace_fyh/TencentPretrainQuan/tencentpretrain/encoders/transformer_encoder.py", line 142, in forward
hidden, prev_attn = self.transformer[i](hidden, mask, position_bias=position_bias,
File "/home/ubuntu/miniconda3/envs/fyh-3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, kwargs)
File "/home/ubuntu/workspace_fyh/TencentPretrainQuan/tencentpretrain/layers/transformer.py", line 80, in forward
output = self.dropout_2(self.feed_forward(output)) + hidden
File "/home/ubuntu/miniconda3/envs/fyh-3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, *kwargs)
File "/home/ubuntu/workspace_fyh/TencentPretrainQuan/tencentpretrain/layers/position_ffn.py", line 30, in forward
gate = self.act(self.linear_gate(x))
File "/home/ubuntu/miniconda3/envs/fyh-3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(input, kwargs)
File "/home/ubuntu/miniconda3/envs/fyh-3.8/lib/python3.8/site-packages/bitsandbytes/nn/modules.py", line 242, in forward
out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
File "/home/ubuntu/miniconda3/envs/fyh-3.8/lib/python3.8/site-packages/bitsandbytes/autograd/_functions.py", line 488, in matmul
return MatMul8bitLt.apply(A, B, out, bias, state)
File "/home/ubuntu/miniconda3/envs/fyh-3.8/lib/python3.8/site-packages/bitsandbytes/autograd/_functions.py", line 338, in forward
) = F.double_quant(B.to(torch.float16))
File "/home/ubuntu/miniconda3/envs/fyh-3.8/lib/python3.8/site-packages/bitsandbytes/nn/modules.py", line 199, in to
super().to(
RuntimeError: CUDA out of memory. Tried to allocate 86.00 MiB (GPU 0; 14.62 GiB total capacity; 12.67 GiB already allocated; 11.38 MiB free; 13.34 GiB reserved in total by PyTorch)