T5 GPU Runtime Degradation

dsgissin commented 3 years ago

Environment info

transformers version: 4.2.1 VS 3.4.0
Platform: Colab (K80 GPU)
Python version: 3.6.9
PyTorch version (GPU?): 1.7.0+cu101
Tensorflow version (GPU?): N.A.
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help

@patrickvonplaten, @patil-suraj

Information

Model I am using (Bert, XLNet ...): T5

The problem arises when using:

[x] the official example scripts: (give details below)
[] my own modified scripts: (give details below)

The tasks I am working on is:

[ ] an official GLUE/SQUaD task: (give the name)
[x] my own task or dataset: (give details below)

Hello,

I’ve noticed that the running time of T5 on a GPU has increased between v3.4.0 and the current version (v4.2.1). When running inference on a single example on a K80 GPU (Google Colab), the average runtime of a generate() call for a single example (the one in the transformers documentation) with t5-base in v3.4.0 is 539 ± 13 ms, while the runtime for v4.2.1 is 627 ± 13 ms. On t5-large, the difference is 1004 ± 22 ms, compared to 1242 ± 15 ms.

I made two colab notebooks that compare the two versions: https://colab.research.google.com/drive/1Rm9RFdfLUFFHOvjAOg816-6oXw8zm_tE?usp=sharing#scrollTo=eeJ0sS_g7-X2 https://colab.research.google.com/drive/1U2QPA4MR48xPCpn4XiG5KBk3qZGYeoIJ?usp=sharing

I’m aware of a at least one bug fix that was made to the attention mechanism of T5 in v4.0.0 (#8158), but I don’t think this change should have caused such a degradation. Any idea why such a degradation occurred?

Thanks!

To reproduce

See Colab notebooks attached. See the following code snippet as well:

device = torch.device('cuda:0') if torch.cuda.is_available() else torch.device('cpu')
print(f"Using device: {device}")

t5_tokenizer = T5TokenizerFast.from_pretrained('t5-base')
t5_model = T5ForConditionalGeneration.from_pretrained('t5-base')
t5_model = t5_model.to(device)

t5_input_ids = t5_tokenizer("summarize: studies have shown that owning a dog is good for you ", return_tensors="pt").input_ids  # Batch size 1
t5_input_ids = t5_input_ids.to(device)

import time
import numpy as np
N = 100
times = []
for _ in range(N):
  start = time.time()
  t5_outputs = t5_model.generate(t5_input_ids)
  end = time.time()
  times.append(end-start)
print(f"transformers version: {transformers_version}")
print(f"torch version: {torch_version}")
print(f"{1000*np.mean(times):.0f} ms \u00B1 {1000*np.std(times):.2f} ms per loop (mean \u00B1 std of {N} runs)")

patrickvonplaten commented 3 years ago

Thanks a lot for this issue @dsgissin! Will take a look this week!

dsgissin commented 3 years ago

Hey! Did you get a chance to look into the runtime degradation?

Thanks

patrickvonplaten commented 3 years ago

Looking now! Sorry for the delay

patrickvonplaten commented 3 years ago

Okey, I can reproduce the degradation! Will try to fix it today

patrickvonplaten commented 3 years ago

I think this PR should fix it: https://github.com/huggingface/transformers/pull/10496

Let me know if you still encounter a degradation!

Thanks a mille for spotting this degradation - you probably now made T5 faster for the whole community :-)

dsgissin commented 3 years ago

Great, thanks a lot for the quick fix!

huggingface / transformers