tloen / llama-int8

Quantized inference code for LLaMA models
GNU General Public License v3.0
1.05k stars 105 forks source link

CUDA out of memory #19

Closed fengyh3 closed 1 year ago

fengyh3 commented 1 year ago

Hi, i try to add int8 inference of llama in my code, but i don't want to edit my original model structure. So i try similar to your quantize: https://github.com/tloen/llama-int8/blob/ce74669c767e42b5082391dd0cfcb621ba40c7f9/llama/model.py#L286

first of all, it works, only use 6-7G gpu memory loading 7B model, but in the stage of forward, the gpu memory will increase rapidly and then CUDA out of memory. Have you ever been in this situation? GPU: tesla T4 15G

error trace: Load model with 6.87GB. Traceback (most recent call last): File "scripts/generate_lm_int8.py", line 112, in output = model(src_tensor, seg_tensor) File "/home/ubuntu/miniconda3/envs/fyh-3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, kwargs) File "/home/ubuntu/miniconda3/envs/fyh-3.8/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context return func(*args, *kwargs) File "scripts/generate_lm_int8.py", line 39, in forward output = self.encoder(emb, seg) File "/home/ubuntu/miniconda3/envs/fyh-3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(input, kwargs) File "/home/ubuntu/workspace_fyh/TencentPretrainQuan/tencentpretrain/encoders/transformer_encoder.py", line 142, in forward hidden, prev_attn = self.transformer[i](hidden, mask, position_bias=position_bias, File "/home/ubuntu/miniconda3/envs/fyh-3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, kwargs) File "/home/ubuntu/workspace_fyh/TencentPretrainQuan/tencentpretrain/layers/transformer.py", line 80, in forward output = self.dropout_2(self.feed_forward(output)) + hidden File "/home/ubuntu/miniconda3/envs/fyh-3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, *kwargs) File "/home/ubuntu/workspace_fyh/TencentPretrainQuan/tencentpretrain/layers/position_ffn.py", line 30, in forward gate = self.act(self.linear_gate(x)) File "/home/ubuntu/miniconda3/envs/fyh-3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(input, kwargs) File "/home/ubuntu/miniconda3/envs/fyh-3.8/lib/python3.8/site-packages/bitsandbytes/nn/modules.py", line 242, in forward out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state) File "/home/ubuntu/miniconda3/envs/fyh-3.8/lib/python3.8/site-packages/bitsandbytes/autograd/_functions.py", line 488, in matmul return MatMul8bitLt.apply(A, B, out, bias, state) File "/home/ubuntu/miniconda3/envs/fyh-3.8/lib/python3.8/site-packages/bitsandbytes/autograd/_functions.py", line 338, in forward ) = F.double_quant(B.to(torch.float16)) File "/home/ubuntu/miniconda3/envs/fyh-3.8/lib/python3.8/site-packages/bitsandbytes/nn/modules.py", line 199, in to super().to( RuntimeError: CUDA out of memory. Tried to allocate 86.00 MiB (GPU 0; 14.62 GiB total capacity; 12.67 GiB already allocated; 11.38 MiB free; 13.34 GiB reserved in total by PyTorch)