stanford-crfm / BioMedLM

590 stars 61 forks source link

Generation is suspiciously slow for long sequences #23

Open avivbrokman opened 11 months ago

avivbrokman commented 11 months ago

I am trying to use BioMedLM for generation, but I find that it is very slow at generation for long sequences. Training occurs at a normal speed. I wrote a minimal program (below) to reproduce this, comparing it to GPT-2 (1.5B parameters) and Flan T5-XL (3B parameters) for comparison. I varied the maximum generation length value, and estimated the ratio of the durations of the decoder models (BioMedLM divided by GPT-2):

1024 tokens: 5.9 512 tokens: 3.2 256 tokens: 1.9 128 tokens: 1.3 64 tokens: 1.01

Anecdotally, the generation speed is similar to that of Flan UL2, a 20B parameter model.

I'd like to fix this—I don't know if the issue is in the the BioMedLM code, my software/environment versions/settings, or my hardware A100-80GB.

import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoModelForSeq2SeqLM
from datetime import datetime

# settings
max_length = 1024

# text
text = 'SRY1 phosphorylates'

# flan-t5-xl - 3B - encoder-decoder model
checkpoint = 'google/flan-t5-xl'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

inputs = tokenizer(text, return_tensors = 'pt')

model = model.to('cuda')
inputs = inputs.to('cuda')

t0 = datetime.now()
output = model.generate(**inputs, max_length = min(512, max_length)
t1 = datetime.now()

print('flan-t5 generation length: ', len(output[0]))
print('flan-t5 duration: ', t1 - t0)

# gpt2 - 1.5B - decoder model
checkpoint = 'gpt2-xl'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint)

inputs = tokenizer(text, return_tensors = 'pt')

model = model.to('cuda')
inputs = inputs.to('cuda')

t2 = datetime.now()
output = model.generate(**inputs, max_length = max_length)
t3 = datetime.now()

print('GPT-2 generation length: ', len(output[0]) - inputs['input_ids'].size(1))
print('GPT-2 duration: ', t3 - t2)

# BioMedLM - 2.7B - decoder model
checkpoint = 'stanford-crfm/BioMedLM'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint)

inputs = tokenizer(text, return_tensors = 'pt')

model = model.to('cuda')
inputs = inputs.to('cuda')

t4 = datetime.now()
outputs = model.generate(**inputs, max_length = max_length)
t5 = datetime.now()

print('BioMedLM generation length: ', len(output[0]) - inputs['input_ids'].size(1))

print('BioMedLM duration: ', t5 - t4)
J38 commented 11 months ago

Ok I'll do some experiments too and get back to you. Just to double check, you are giving the same prompt to GPT-2 and BioMedLM and running generate and those numbers are the ratio between the 2 models?

Just this week I have been spending a lot of time working on BioMedLM's generative abilities for downstream tasks ... I actually feel it is most useful for scenarios like reading a PubMed abstract and printing out a list of relations derived from the abstract for instance ...

J38 commented 11 months ago

BioMedLM out of the box should just literally be running the same code as GPT-2 since it is just a GPT-2 model with different weights and different tokenizer ... it has a smaller vocabulary than GPT-2 ... we could also compare to GPT Neo 2.7B ...

J38 commented 11 months ago

And what exactly are the input --> outputs ? Are BioMedLM and GPT-2 XL producing text of similar length or is there a difference in average output length? I don't think setting max_length necessarily determines the average length of outputs, so if one model had a tendency to print out longer responses to inputs it would possibly take longer?

avivbrokman commented 11 months ago

Just to double check, you are giving the same prompt to GPT-2 and BioMedLM and running generate and those numbers are the ratio between the 2 models?

Yes, to both

I actually feel it is most useful for scenarios like reading a PubMed abstract and printing out a list of relations derived from the abstract

I laughed when I read this, because I'm doing this exactly. I just wanted to provide a minimal example.

BioMedLM out of the box should just literally be running the same code as GPT-2 since it is just a GPT-2 model with different weights and different tokenizer

This is what I expected—and why I'm confused about the difference in speed.

Are BioMedLM and GPT-2 XL producing text of similar length or is there a difference in average output length?

For my minimal example, they are producing lengths within 2 tokens of each other, so I don't think sequence length accounts for it (also my code prints out number of generated tokens). I'm guessing this is a special tokens difference.