charliedream1 commented 9 months ago

I got OOM during inference on 80G GPU with only input seqence length 1000. I tried to make a loop to feed input with 500 long, then I found GPU memory keeps increasing until OOM. what's the problem? I tried to change different version of transformers, and it's not working.

kamalkraj commented 9 months ago

share the code to reproduce the issue.

charliedream1 commented 9 months ago

thx for fast reply. I found it might be the reason of input text, It contains Chinese and some unreadable messy characters. It causes GPU memory keeps growing until OOM. The data is private and large so I can't share here. However, I used fixed text (both Chinese and English) with code below, the problem can't be reproduced. Will it be the reason of special token causes problem. Error happens at attention softmax part with OOM as below

======================================================================== Error message:

outputs = model(**batch_dict)

File "/home/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/home/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "/home/miniconda3/lib/python3.10/site-packages/accelerate/hooks.py", line 164, in new_forward output = module._old_forward(args, kwargs) File "/home/miniconda3/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 929, in forward layer_outputs = decoder_layer( File "/home/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/home/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "/home/miniconda3/lib/python3.10/site-packages/accelerate/hooks.py", line 164, in new_forward output = module._old_forward(args, kwargs) File "/home/miniconda3/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 654, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( File "/home/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/home/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "/home/miniconda3/lib/python3.10/site-packages/accelerate/hooks.py", line 164, in new_forward output = module._old_forward(args, kwargs) File "/home/miniconda3/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 283, in forward attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.49 GiB. GPU 1 has a total capacty of 79.11 GiB of which 788.62 MiB is free. Including non-PyTorch memory, this process has 78.33 GiB memory in use. Of the allocated memory 76.03 GiB is allocated by PyTorch, and 1.64 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 0%| | 2/959 [00:02<20:32, 1.29s/it]

========================================================================

Check code below:

========================================================================

import torch import torch.nn.functional as F

from torch import Tensor from transformers import AutoTokenizer, AutoModel

""" https://huggingface.co/intfloat/e5-mistral-7b-instruct

Notice: """

def last_token_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor: left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0]) if left_padding: return last_hidden_states[:, -1] else: sequence_lengths = attention_mask.sum(dim=1) - 1 batch_size = last_hidden_states.shape[0] return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]

def get_detailed_instruct(task_description: str, query: str) -> str: return f'Instruct: {task_description}\nQuery: {query}'

Each query must come with a one-sentence instruction that describes the task

task = 'Given a web search query, retrieve relevant passages that answer the query' queries = [ get_detailed_instruct(task, 'how much protein should a female eat'), get_detailed_instruct(task, 'summit define') ]

No need to add instruction for retrieval documents

documents = [ "As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. " "But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. " "Check out the chart below to see how much protein you should be eating each day." 9000, "Definition of summit for English Language Learners. : 1 the highest point of a " "mountain : the top of a mountain. : 2 the highest level. : 3 a " "meeting or series of meetings between the leaders of two or more governments." 9000 ]

input_texts = queries + documents

input_texts = documents

mdl_path = '/mdl/mdl_zoo/intfloat--e5-mistral-7b-instruct' tokenizer = AutoTokenizer.from_pretrained(mdl_path, trust_remote_code=True) model = AutoModel.from_pretrained(mdl_path, trust_remote_code=True, torch_dtype=torch.float16) model.to('cuda')

max_length = 500

Tokenize the input texts

batch_dict = tokenizer(input_texts, max_length=max_length - 1, return_attention_mask=False, padding=False, truncation=True)

append eos_token_id to every input_ids

batch_dict['input_ids'] = [input_ids + [tokenizer.eos_token_id] for input_ids in batch_dict['input_ids']] batch_dict = tokenizer.pad(batch_dict, padding=True, return_attention_mask=True, return_tensors='pt') batch_dict.to(model.device) for i in range(500): print(i) outputs = model(**batch_dict) embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

normalize embeddings

embeddings = F.normalize(embeddings, p=2, dim=1) scores = (embeddings[:2] @ embeddings[2:].T) * 100 print(scores.tolist())

kamalkraj commented 9 months ago

Instead of trying to generate embedding for the whole documents in a single pass, try generating embedding for one sentence at a time, and try increasing the no.of items in batch until it fails.

charliedream1 commented 9 months ago

The problem can be reproduced as below.

problem:

By using device_map="auto", I split model on 8 * 80G GPUs. Input sequence length is 8000, then I got OOM.
If use 1 GPU, it also got OOM
I found all GPU with 77G occupation expect device 0 and 7 with below 30G occupation, device 1 got OOM. what would be a problem? 8000 sequence can not be accepted.

reproduce code as below:
```
import torch
import torch.nn.functional as F
```

from torch import Tensor from transformers import AutoTokenizer, AutoModel

""" https://huggingface.co/intfloat/e5-mistral-7b-instruct

Notice: """

def last_token_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor: left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0]) if left_padding: return last_hidden_states[:, -1] else: sequence_lengths = attention_mask.sum(dim=1) - 1 batch_size = last_hidden_states.shape[0] return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]

def get_detailed_instruct(task_description: str, query: str) -> str: return f'Instruct: {task_description}\nQuery: {query}'

No need to add instruction for retrieval documents

documents = [ "As a general guideline, the CDC's average requirement of protein for " "women ages 19 to 70 is 46 grams per day. But, as you can see from this " "chart, you'll need to increase that if you're expecting or training " "for a marathon. Check out the chart below to see how much protein you should be eating each day." * 10000, ]

mdl_path = '/mdl/mdl_zoo/intfloat--e5-mistral-7b-instruct' tokenizer = AutoTokenizer.from_pretrained(mdl_path, trust_remote_code=True) model = AutoModel.from_pretrained(mdl_path, trust_remote_code=True, device_map="auto", torch_dtype=torch.bfloat16)

max_length = 8000

Tokenize the input texts

batch_dict = tokenizer(documents, max_length=max_length - 1, return_attention_mask=False, padding=False, truncation=True)

append eos_token_id to every input_ids

batch_dict['input_ids'] = [input_ids + [tokenizer.eos_token_id] for input_ids in batch_dict['input_ids']] batch_dict = tokenizer.pad(batch_dict, padding=True, return_attention_mask=True, return_tensors='pt')

outputs = model(**batch_dict) embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

normalize embeddings

embeddings = F.normalize(embeddings, p=2, dim=1) scores = (embeddings[:2] @ embeddings[2:].T) * 100 print(scores.tolist())

kamalkraj commented 9 months ago

Even though model support upto 32K tokens, the recommend length by the author is 4K and model fine-tuning is done using a max length of 512 tokens

charliedream1 commented 9 months ago

Thx, but what would be the reason of oom？7b modle should support 32k length without oom on a 80G card

---Original--- From: "Kamal Raj @.> Date: Thu, Feb 1, 2024 19:12 PM To: @.>; Cc: "Optimus @.**@.>; Subject: Re: [kamalkraj/e5-mistral-7b-instruct] Inference OOM with 80G GPU1000 input sequence length (Issue #7)

Even though model support upto 32K tokens, the recommend length by the author is 4K and model fine-tuning is done using a max length of 512 tokens

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

kamalkraj / e5-mistral-7b-instruct

Inference OOM with 80G GPU 1000 input sequence length #7

Each query must come with a one-sentence instruction that describes the task

No need to add instruction for retrieval documents

input_texts = queries + documents

Tokenize the input texts

append eos_token_id to every input_ids

normalize embeddings

The problem can be reproduced as below.

I found all GPU with 77G occupation expect device 0 and 7 with below 30G occupation, device 1 got OOM. what would be a problem? 8000 sequence can not be accepted.

No need to add instruction for retrieval documents

Tokenize the input texts

append eos_token_id to every input_ids

normalize embeddings