huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.3k stars 26.85k forks source link

GPU memory getting out of bound #3141

Closed mainulquraishi closed 4 years ago

mainulquraishi commented 4 years ago

I am trying to run the pre-trained small GPT model with language head with batch size 16. Problem is, after each iteration about 440MB of memory is allocated and quickly the GPU memory is getting out of bound. I am not running the pre-trained model in training mode. In my understanding, in each iteration a single word (16 word for batch size 16) is going as input (from the second iteration) and the new attention is calculated and the past variable will be updated and increased for 16 word. So, a little bit of memory usage is expected but I don't understand why it is almost half a GB. I ran the following code to measure the memory usage in each iteration:

before=torch.cuda.max_memory_allocated(device=device)
output, past = model(b_train_contexts,past=past)
print("memory usage")
after=torch.cuda.max_memory_allocated(device=device)
print(after-before)

Output:

memory
0
memory
270742528
memory
442328576
memory
443433472
memory
444525056
memory
445629952
memory
446721536
memory
447826432
memory
448918016
.
.
.
LysandreJik commented 4 years ago

Hi, could you provide a reproducible example so that we may test on our side?

mainulquraishi commented 4 years ago

Thank you for your reply. Here is the code and my 32GB GPU memory getting out of bound before 500 iteration.

from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained('gpt2')
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
model.eval()

text="The Manhattan Bridge is a suspension bridge that crosses the East River in New York City, connecting Lower Manhattan at Canal Street with Downtown Brooklyn at the Flatbush Avenue Extension. The main span is 1,470 ft (448 m) long, with the suspension cables being 3,224 ft (983 m) long. The bridge's total length is 6,855 ft (2,089 m). It is one of four toll-free vehicular bridges connecting Manhattan Island to Long Island; the nearby Brooklyn Bridge is just slightly further downtown, while the Queensboro and Williamsburg Bridges are to the north."

generated1= tokenizer.encode(text)
generated2=tokenizer.encode(text)
context = torch.tensor([generated1,generated2])
context =context.to(device)
print(context.shape)
past = None

for i in range(500):
    before=torch.cuda.max_memory_allocated(device=device)
    output, past = model(context, past=past)
    after=torch.cuda.max_memory_allocated(device=device)
    print(after-before)
    token = torch.argmax(output[..., -1, :],dim=1)

    context = token.view(2,-1)

If I use a small initial context, this can survive. But problem happens when I use a long initial context. Please try with a small initial context and you will see difference in memory allocation in each iteration.

LysandreJik commented 4 years ago

I guess this is because the past requires a lot of memory to be saved. It speeds up the sequential decoding but requires a lot of memory. Your script crashes for me at iteration 483, but a script that doesn't make use of the past can reach the maximum length of 1024 tokens on my 24GB of VRAM.

Dropping the past when it becomes too large may be a good idea, same as you would do if it were to go over the max sequence length.

mainulquraishi commented 4 years ago

Hi, Thanks for the reply. By "script that does not make use of past", you mean in each iteration the input is (previous context+ generated token id)?

I did the following code. For batch size=8, it does work. No memory out of bound error. But for batch size=16, the error comes back.

from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained('gpt2')

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()
torch.cuda.get_device_name()

model = model.to(device)
model.eval() 

text="Construction began on the bridge in 1901 under the instruction of the New York City Department of Bridges commissioner Gustav Lindenthal and the chief engineer R.S. Buck. Just three years later, however, local politicking was responsible for the pair being replaced with George E. Best and Othniel Foster Nichols, respectively. The bridge design was based on deflection theory, a new concept at the time that was developed by Joseph Melan and applied to the bridge by the chief engineer Leon Moisseiff. This design saved in cost, material, and construction time. The bridge was officially opened to traffic on Dec. 31, 1909. Renovations in 1940 revealed significant wear on the structure, with the subway trains partly responsible for the wear. Those trains, upon entering the bridge at the same time from opposite sides, would cause the bridge to shift up to 8 feet (approximately 2.5 metres). Additional renovations were undertaken in 1978. Since then the Manhattan Bridge has been featured in movies, has undergone regular repairs and retrofitting, and remains one of the most graceful bridges in New York City."
generated1= tokenizer.encode(text)
generated2=tokenizer.encode(text)
generated3= tokenizer.encode(text)
generated4=tokenizer.encode(text)
generated5= tokenizer.encode(text)
generated6=tokenizer.encode(text)
generated7= tokenizer.encode(text)
generated8=tokenizer.encode(text)
# generated9= tokenizer.encode(text)
# generated10=tokenizer.encode(text)
# generated11= tokenizer.encode(text)
# generated12=tokenizer.encode(text)
# generated13= tokenizer.encode(text)
# generated14=tokenizer.encode(text)
# generated15= tokenizer.encode(text)
# generated16=tokenizer.encode(text)

context=torch.tensor([generated1,generated2,generated3,generated4,generated5,generated6,generated7,generated8])
# context =generated
# generated =generated.to(device)
context =context.to(device)
print(context.shape)

import time
batch_size=8
start_time = time.time()
for i in range(500):    
    output, past = model(context)
    new_tokens = torch.argmax(output[..., -1, :],dim=1)
    new_tokens = new_tokens.view(batch_size,-1)
    context=torch.cat([context,new_tokens],dim=1)
elapsed_time = time.time() - start_time
print("time")
print(elapsed_time)
stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

ecolss commented 1 year ago

What did you mean by dropping the past? Any example?