Pipeline seems slower in 4.11+

Dref360 commented 2 years ago

Hello! When I upgraded Transformers, I got a massive slowdown. Might be related to the new DataLoader used in Pipeline.

Happy to help!

Cheers,

Environment info

Environment

transformers version: 4.12.0.dev0
Platform: macOS-10.16-x86_64-i386-64bit
Python version: 3.8.12
PyTorch version (GPU?): 1.9.1 (False)
Tensorflow version (GPU?): 2.6.0 (False)
Flax version (CPU?/GPU?/TPU?): 0.3.5 (cpu)
Jax version: 0.2.24
JaxLib version: 0.1.73
Using GPU in script?: No
Using distributed or parallel set-up in script?:

Who can help

Models:

ALBERT, BERT, XLM, DeBERTa, DeBERTa-v2, ELECTRA, MobileBert, SqueezeBert: @LysandreJik

Library:

Pipelines: @Narsil

Model I am using (Bert, XLNet ...): DistilBert, but I suspect this is for all Pipeline.

The problem arises when using:

[ ] the official example scripts: (give details below)
[x] my own modified scripts: (give details below)

The tasks I am working on is:

[ ] an official GLUE/SQUaD task: (give the name)
[x] my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

I use the following script to predict on some random sentences:

from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    TextClassificationPipeline,
)

def get_pipeline():
    name = "distilbert-base-uncased-finetuned-sst-2-english"
    model = AutoModelForSequenceClassification.from_pretrained(name)
    tokenizer = AutoTokenizer.from_pretrained(name)
    return TextClassificationPipeline(tokenizer=tokenizer, model=model)

sentence = ["hello", "goodbye"] * 100
model = get_pipeline()

The results that I get are widely different from Transformers 4.10 vs 4.11+

Version	Command	Time
HF 4.12.0.dev0	%timeit -n 3 model(sentence)	Does not complete after 10 minutes.
HF 4.12.0.dev0	%timeit -n 3 model(sentence, num_workers=0)	4.67 s ± 153 ms per loop
HF 4.10.3	%timeit -n 3 model(sentence)	575 ms ± 10.8 ms per loop
HF 4.10.3	%timeit -n 3 model(sentence, num_workers=0)	500 ms ± 3.01 ms per loop

Expected behavior

I would expect the same performance if possible, or a way to bypass Pytorch DataLoader.

Narsil commented 2 years ago

Hi @Dref360 ,

First of all, thanks for the script and benchmarks, very helpful. Your results are entirely correct and are reproducible.

Short answer: https://github.com/huggingface/transformers/pull/13724 this PR should solve your specific use case with pipeline(sentence, batch_size=100).

Long answer:

You example is slightly odd, by being a single token repeated 200 times, so batching yields better results than non batching. If you use longer sentences, you can get better or worse performance:

sentence = ["hello there this is a test" * 20, "goodbye"] * 100 for instance takes

12s on 4.10.3 8s on 4.11+

sentence = ["hello there this is a test" * 20, "goodbye " * 10] * 100 for instance takes

11s on 4.10.3 11s on 4.11+

That somehow seems to average out on random strings, leading to closer performance (in our internal testing) which is why we're NOT batching by default anymore. The biggest batching place would be on GPU (not your case) but the 4.10 performance on GPU was pretty bad anyway because the pipeline API didn't allow for streaming properly.

This example is the perfect example where batching yields vastly faster results, but it might not be representative of other workloads (just a caveat for readers, always measure performance on your own models/data to make sure what works best)

On longer sequences, the matrix multiplications get larger, and batching does not allow the CPU to get better throughput than without (GPU get the benefit for larger payloads).

For core maintainers: @LysandreJik , @sgugger with the propose PR for batch support on pipelines we can also take the opportunity to use batch_size = len(input_list) when the input is a list to get back previous behavior (on pipelines that did do batching, ones that didn't might start to fail because some models might not have padding (like text-generation with gpt2).

We could also do like for overall inputs/outputs and have some way for pipelines to express if they batch by default or not and be as backward compatible as possible in terms of performance. I am unsure it's worth it (as mentionned in this comment, performance might have increased on other models/data) , but definitely an option.

Definitely something that was overlooked when I observed the similar performance, I didn't look at these kind of inputs where it does make a difference.

Dref360 commented 2 years ago

Ah! Thank you for the quick and very detailled response.

My usecase is mostly small sentences so that must be why we saw such a massive slowdown.

Thank you for your help! We will wait for this PR to be merged :)

alwayscurious commented 2 years ago

@Narsil in your comment (the long answer) you mention GPU performance is poor for transformer pipelines for previous versions (4.10 or earlier). I'm currently using 4.12.0 and observe that the GPU isn't fully utilized . I'm using a sentiment analysis hugging face model: https://huggingface.co/yiyanghkust/finbert-tone with the following setup:

Environment:

Hardware: Azure Node Type: Standard_NC8as_T4_v3

8 cores
56 GB memory
1 Tesla T4 GPU

Software:

pytorch: 1.9.0+cu111
python: 3.8.10
pytorch_pretrained_bert: 0.6.2
transformers: 4.12.0
gpustat: 0.6.0

I'm running the following code:

from transformers import BertTokenizer, BertForSequenceClassification
from transformers import pipeline

finbert = BertForSequenceClassification.from_pretrained('yiyanghkust/finbert-tone',num_labels=3)
tokenizer = BertTokenizer.from_pretrained('yiyanghkust/finbert-tone')

nlp = pipeline("sentiment-analysis", model=finbert, tokenizer=tokenizer, device=0)

sentences = ["there is a shortage of capital, and we need extra financing",  
             "growth is strong and we have plenty of liquidity", 
             "there are doubts about our finances", 
             "profits are flat"]*1000

results = nlp(sentences)
print(results[:4])

Running gpustat (https://pypi.org/project/gpustat/) on the node reports the following while the above code is running (which does so for about 35 secs) shows the following:

1026-105747-9wr7sbr7-10-139-64-13 Fri Oct 29 09:32:34 2021 450.80.02 [0] Tesla T4 | 49'C, 39 % | 1916 / 16127 MB |

Here you see that only 39% of the GPU is used. Is there a reason why it isn't near 100%?

For comparison, the same model based on pytorch_pretrained_bert only which can be found here: https://github.com/yya518/FinBERT/blob/master/FinBert%20Model%20Example.ipynb

using the same input of sentences, it performs significantly faster (approx 5 seconds). Here is the gpu usage: 1026-105747-9wr7sbr7-10-139-64-17 Fri Oct 29 16:42:58 2021 450.80.02 [0] Tesla T4 | 45'C, 96 % | 2258 / 16127 MB |

With this approach the gpu uasge is close to full capacity. Although I only tested this model, I suspect inference on a GPU using other tranformer models will also under utilize the GPU.

Will the ability to set the batch size greater than 1 via the PR help with this? I see that the PR #13724 has been merged. When can we expect the next release?

Thanks!

Narsil commented 2 years ago

Hi @alwayscurious ,

The notebook linked does not have * 1000, effectively killing the measuring, is that just an omission or does it change the results ? The following assumes it actually modifies the results.
In my modified script of your test (I used the same model as the pipelines example, with * 1000 added back), I get 100% GPU usage, but it takes 3mn the run the full thing while it takes 35s on the pipeline example. GPU usage is not everything here :).
You are perfectly correct that the GPU is underused with the pipeline example, and we can push it on master transformers with pipeline(sentences, batch_size=64). Increasing the size of the batch does yield improved speed pretty fast and at some point it's not worth putting bigger batches (when you saturate the GPU basically). Then the full thing runs under 5s on my home GTX 970.

You are reading 100% GPU usage but much lower speed on your colab example because all your examples are padded to 512 max length, so effectively the examples are super large for the GPU (keeping it busy) but it's mostly doing useless work (hence 3mn instead of 35s)

The ~50% GPU utilization of the first example, is because the example+model is a bit small so no all the GPU is required to run, meaning part of the GPU is idle. However it's still running faster that the "old" example, because it's not wasting cyles on the padded tokens. If I remove the padding I fall back on roughly ~35s mentioned above. On larger models there would still probably be a difference linked to how the data is actually fed to the GPU but out of scope for this discussion.

By adding pipeline(sentences, batch_size=64) I am getting 5s runtime of the inference.

On a T4, you might be able to push the size of the batch even more, however I always tell users to be careful, running on mock data and real data is likely to be different, by adding more to the batch, you risk getting into OOM errors on live data that might be max_seq_len long, then the whole batch can be bigger. Even before OOM, if the data is highly unregular in terms of size the batching can hinder performance instead of helping it. Just like in the notebook, it's filling your batch of pad_tokens. See this for the discussion: https://github.com/huggingface/transformers/blob/master/docs/source/main_classes/pipelines.rst.

Another knob you can turn is pipelines(..., num_workers=10) which is the number of threads used to feed the data to the GPU (it's DataLoader's argument) and might also help depending on your model/data configuration (rule of thumb is num_workers=number of CPU cores).

Did I omit anything in this analysis ?

Dref360 commented 2 years ago

I dont see the issue on master. Thank you!

alwayscurious commented 2 years ago

@Narsil thanks for your insightful reply! Indeed I also observed the 35 seconds with transformers vs approx 3 minutes with "old code" performance you mention after fixing a bug in the original "old" code (I was using a previous version of the notebook and realized it had a bug compared to the latest one in the link). I used the latest release of transformers (4.12.2).

Regarding the input setup you are correct to add 1000 and I forgot to add that to the notebook (only included it in the code snippet). From the release notes of 4.12.2, I see the batch_size is included: https://github.com/huggingface/transformers/compare/v4.12.2...master. Can I use this version or do I need to build the transformers package manually from master? I set the batch size to 64 but continue to see aprox. 35 seconds for inference compared to approx. 5 seconds that you observe on your GTX 970. I'll setup a colab notebook with GPU runtime to verify.

Thanks again!

alwayscurious commented 2 years ago

@Narsil, triggered by @Dref360 comment, I realized using the following:

!pip install git+https://github.com/huggingface/transformers.git

creates a package directly from the latest commits to master. I verified the performance you observed with batch size of 64 (approx 7s on a K80 GPU). I included a link to the notebook for reference.

Huggingface FinBertTone Model performance on GPU

Thanks again for your help and the reference! :)

Narsil commented 2 years ago

@alwayscurious glad to be of help.

Again, batch_size will depend on data + model + hardware, so try to keep track of some measure if possible (GPU utilization is the easy one, but the amount of padding is another one, measuring everything will slow you down so .... :)).

Enabling automated batch_size is something we would like to enable, but it's quite tricky, and maybe not worth it. At least now you are in control.

Dref360 commented 2 years ago

I'll close the issue now that it is merged on master.

Cheer

huggingface / transformers