Open charliedream1 opened 9 months ago
share the code to reproduce the issue.
thx for fast reply. I found it might be the reason of input text, It contains Chinese and some unreadable messy characters. It causes GPU memory keeps growing until OOM. The data is private and large so I can't share here. However, I used fixed text (both Chinese and English) with code below, the problem can't be reproduced. Will it be the reason of special token causes problem. Error happens at attention softmax part with OOM as below
======================================================================== Error message:
outputs = model(**batch_dict)
File "/home/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/home/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "/home/miniconda3/lib/python3.10/site-packages/accelerate/hooks.py", line 164, in new_forward output = module._old_forward(args, kwargs) File "/home/miniconda3/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 929, in forward layer_outputs = decoder_layer( File "/home/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/home/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "/home/miniconda3/lib/python3.10/site-packages/accelerate/hooks.py", line 164, in new_forward output = module._old_forward(args, kwargs) File "/home/miniconda3/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 654, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( File "/home/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/home/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "/home/miniconda3/lib/python3.10/site-packages/accelerate/hooks.py", line 164, in new_forward output = module._old_forward(args, kwargs) File "/home/miniconda3/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 283, in forward attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.49 GiB. GPU 1 has a total capacty of 79.11 GiB of which 788.62 MiB is free. Including non-PyTorch memory, this process has 78.33 GiB memory in use. Of the allocated memory 76.03 GiB is allocated by PyTorch, and 1.64 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 0%| | 2/959 [00:02<20:32, 1.29s/it]
========================================================================
Check code below:
========================================================================
import torch import torch.nn.functional as F
from torch import Tensor from transformers import AutoTokenizer, AutoModel
""" https://huggingface.co/intfloat/e5-mistral-7b-instruct
Notice: """
def last_token_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor: left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0]) if left_padding: return last_hidden_states[:, -1] else: sequence_lengths = attention_mask.sum(dim=1) - 1 batch_size = last_hidden_states.shape[0] return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]
def get_detailed_instruct(task_description: str, query: str) -> str: return f'Instruct: {task_description}\nQuery: {query}'
task = 'Given a web search query, retrieve relevant passages that answer the query' queries = [ get_detailed_instruct(task, 'how much protein should a female eat'), get_detailed_instruct(task, 'summit define') ]
documents = [ "As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. " "But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. " "Check out the chart below to see how much protein you should be eating each day." 9000, "Definition of summit for English Language Learners. : 1 the highest point of a " "mountain : the top of a mountain. : 2 the highest level. : 3 a " "meeting or series of meetings between the leaders of two or more governments." 9000 ]
input_texts = documents
mdl_path = '/mdl/mdl_zoo/intfloat--e5-mistral-7b-instruct' tokenizer = AutoTokenizer.from_pretrained(mdl_path, trust_remote_code=True) model = AutoModel.from_pretrained(mdl_path, trust_remote_code=True, torch_dtype=torch.float16) model.to('cuda')
max_length = 500
batch_dict = tokenizer(input_texts, max_length=max_length - 1, return_attention_mask=False, padding=False, truncation=True)
batch_dict['input_ids'] = [input_ids + [tokenizer.eos_token_id] for input_ids in batch_dict['input_ids']] batch_dict = tokenizer.pad(batch_dict, padding=True, return_attention_mask=True, return_tensors='pt') batch_dict.to(model.device) for i in range(500): print(i) outputs = model(**batch_dict) embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
embeddings = F.normalize(embeddings, p=2, dim=1) scores = (embeddings[:2] @ embeddings[2:].T) * 100 print(scores.tolist())
Instead of trying to generate embedding for the whole documents in a single pass, try generating embedding for one sentence at a time, and try increasing the no.of items in batch until it fails.
problem:
reproduce code as below:
import torch
import torch.nn.functional as F
from torch import Tensor from transformers import AutoTokenizer, AutoModel
""" https://huggingface.co/intfloat/e5-mistral-7b-instruct
Notice: """
def last_token_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor: left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0]) if left_padding: return last_hidden_states[:, -1] else: sequence_lengths = attention_mask.sum(dim=1) - 1 batch_size = last_hidden_states.shape[0] return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]
def get_detailed_instruct(task_description: str, query: str) -> str: return f'Instruct: {task_description}\nQuery: {query}'
documents = [ "As a general guideline, the CDC's average requirement of protein for " "women ages 19 to 70 is 46 grams per day. But, as you can see from this " "chart, you'll need to increase that if you're expecting or training " "for a marathon. Check out the chart below to see how much protein you should be eating each day." * 10000, ]
mdl_path = '/mdl/mdl_zoo/intfloat--e5-mistral-7b-instruct' tokenizer = AutoTokenizer.from_pretrained(mdl_path, trust_remote_code=True) model = AutoModel.from_pretrained(mdl_path, trust_remote_code=True, device_map="auto", torch_dtype=torch.bfloat16)
max_length = 8000
batch_dict = tokenizer(documents, max_length=max_length - 1, return_attention_mask=False, padding=False, truncation=True)
batch_dict['input_ids'] = [input_ids + [tokenizer.eos_token_id] for input_ids in batch_dict['input_ids']] batch_dict = tokenizer.pad(batch_dict, padding=True, return_attention_mask=True, return_tensors='pt')
outputs = model(**batch_dict) embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
embeddings = F.normalize(embeddings, p=2, dim=1) scores = (embeddings[:2] @ embeddings[2:].T) * 100 print(scores.tolist())
Even though model support upto 32K tokens, the recommend length by the author is 4K and model fine-tuning is done using a max length of 512 tokens
Thx, but what would be the reason of oom?7b modle should support 32k length without oom on a 80G card
---Original--- From: "Kamal Raj @.> Date: Thu, Feb 1, 2024 19:12 PM To: @.>; Cc: "Optimus @.**@.>; Subject: Re: [kamalkraj/e5-mistral-7b-instruct] Inference OOM with 80G GPU1000 input sequence length (Issue #7)
Even though model support upto 32K tokens, the recommend length by the author is 4K and model fine-tuning is done using a max length of 512 tokens
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
I got OOM during inference on 80G GPU with only input seqence length 1000. I tried to make a loop to feed input with 500 long, then I found GPU memory keeps increasing until OOM. what's the problem? I tried to change different version of transformers, and it's not working.