Batching / Transformer-based Parallelization for Cross-Encoders

bmw-friedrich-mayr commented 2 years ago

I opened this feature-request on the basis of this slack thread and @jobergum 's comment.

As mentioned by many Vespa users in slack, it would be awesome if there was an example for batching for cross-encoders. (Maybe even with 100 documents for reranking, as batching speeds up cross-encoders tremendously)

As you probably noticed cross-encoders are very slow in Vespa. I know cross-encoder are considerably slower than bi-encoders. But I think Vespa misses a crucial thing that speeds them up tremendously, batching. Transformers are able to process multiple inputs in parallel. From my point of view Vespa fetches only one sample to a cross-encoder, which is detrimental for performance.

Parallel calculation with Hugging Face:

cross_model = '<MODEL>'
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model = AutoModelForSequenceClassification.from_pretrained(cross_model)
tokenizer = AutoTokenizer.from_pretrained(cross_model)
model.eval()

query = "How many citizens has Berlin?"
docs = ["Berlin has a population of 3.5 million in 2016, up slightly from 3.4 million in 2014", ...]
features = tokenizer([query for doc in docs], docs, padding=True, truncation=True, return_tensors="pt", max_length=128)

with torch.no_grad():
    scores = model(**features).logits
    print(scores)

jobergum commented 2 years ago

@bmw-friedrich-mayr

I've started on this work and an Reranking in container example will be added to the ms marco ranking sample application. ETA mid next week.

bmw-friedrich-mayr commented 2 years ago

That's great. Thank you @jobergum 🙂!

jobergum commented 2 years ago

Just an observation on batching input sequences and cpu serving (Vespa is currently cpu only)

Consider this from Optimizing bert model for intel cpu cores

If you look at the number of sequences/second (Y axis) and compare the batch sizes per input sequence length there is not much change in sequences per second. When comparing batch of sizes 1,2,4,8. For a cross-encoder inputing both the query and the passage sequence length 128 is the relevant part of the graph.

As you probably noticed cross-encoders are very slow in Vespa.

The MiniLM model we used for the MS Marco Passage ranking has 6 layers, using int8 weights

With 12 threads per search with batch size 1 and 24 sequence inputs (rerank count) of 128 sequence length the Vespa backend is able to do 480 inferences per second.

jobergum commented 2 years ago

The re-ranking example is ready, plus many other improvements to the mentioned sample application. I've moved all transformer query encoders to the stateless layer and removed the query document type which was used to compute the query encoders. This avoids network round trip and also would make it easier to use auto-scaling features as compute is moved out of the content nodes.

We need some work in Vespa to be able to tune onnx-rt intra threads per model before pushing this to master as one wants to use different number of intra threads per model for optimal performance, e.g the query encoders operate with a short sequence. For example the COLBert multi representation model always uses sequence length 32 due to the padding of mask tokens up to max length, while the dense single representation model is dynamic up to max length. For the latter using more than 1 intra thread is just waste without real benefit. The all to all interaction model with longer sequence length 128 can benefit from more intra threads.

So in short we will need to add support for configuring intra threads per model.

Preliminary benchmarking shows no improvement in the passages/s with batch > 1 over doing batch size 1, maybe even less which is similar results as in the above comment.

bmw-friedrich-mayr commented 2 years ago

Thanks @jobergum. I'm amazed by your fast support. Do I understand correctly that batching has little effect in your experiment? If this is the case, there has to be something wrong. Reranking in Vespa is considerably slower than done with the sentence-transformers library.

Are you sure you pass the full batch to the model at once?

cross_model = '<MODEL>'
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model = AutoModelForSequenceClassification.from_pretrained(cross_model)
tokenizer = AutoTokenizer.from_pretrained(cross_model)
model.eval()

query = "How many citizens has Berlin?"
docs = ["Berlin has a population of 3.5 million in 2016, up slightly from 3.4 million in 2014", ...]
features = tokenizer([query for doc in docs], docs, padding=True, truncation=True, return_tensors="pt", max_length=128)

with torch.no_grad():
    scores = model(**features).logits
    print(scores)

jobergum commented 2 years ago

@bmw-friedrich-mayr,

Yes. I've compared batch size = 24 with different number of intra threads (1,2,4,8,12,24) with batch size 1 with external threads (12) for the same sequence length. By external threads I mean how we evaluate ONNX models in the stateful content nodes where we use multiple threads to run inference but with intra threads set to 1.

See https://github.com/vespa-engine/vespa/issues/18882#issuecomment-916764298, also this blog post report similar results.

To get higher number of inferences per second when increasing batch size from 1 to 4 the latency of batch size 4 needs to be less than 4x the latency of batch size 1, if that is the case you get higher number of inferences per second using batching.

CPU:

Sequence length 128, Batch = 1, ONNX Runtime CPU 189.56 ms Sequence length 128, Batch = 4, ONNX Runtime CPU 674.05

= 3.55x

Sequence length 256, Batch = 1, ONNX Runtime CPU 360.91 ms Sequence length 256, Batch = 4, ONNX Runtime CPU 1485.55 ms

= 4.1x

GPU/FP32

Sequence length 128, Batch = 1, ONNX Runtime GPU/FP32 4.30 ms Sequence length 128, Batch = 4, ONNX Runtime GPU/FP32 11.28 ms

= 2.6x

Sequence length 256, Batch = 1, ONNX Runtime GPU/FP32 7.37ms Sequence length 256, Batch = 4, ONNX Runtime GPU/FP32 20.34

= 2.75x

In the case with GPU, you get a much higher inferences per second as compared to CPU with batching, while with CPU you don't have the same effect, with sequence length 256 you get less and with 128, slightly more.

That said, we plan to add GPU which benefits from batching and that might happen sooner than later now that we have support for doing inference at the container layer. Tracked in https://github.com/vespa-engine/vespa/issues/14406

jobergum commented 2 years ago

Do I understand correctly that batching has little effect in your experiment?

Batching has little effect in my experiments or other reported experiments online. On CPU that is. GPU is a different story as demonstrated in the comment above.

If this is the case, there has to be something wrong.

Why is that?

Reranking in Vespa is considerably slower than done with the sentence-transformers library.

Your examples with transformers/pytorch will use several threads by default to evaluate the model

Here is an example using a batch 24

from transformers import BertTokenizer, AutoModelForSequenceClassification
import torch
name="cross-encoder/ms-marco-MiniLM-L-6-v2"
model = AutoModelForSequenceClassification.from_pretrained(name)
tokenizer = BertTokenizer.from_pretrained(name)
model.eval()
from time import time

query = "who shot kennedy?"
text =  "Lee Harvey Oswald (October 18, 1939 – November 24, 1963) was an American former Marine and Marxist who assassinated United States President John F. Kennedy on November 22, 1963. According to four federal government investigations and one municipal investigation, Oswald shot and killed Kennedy from a sniper's nest as the President traveled by motorcade through Dealey Plaza in the city of Dallas, Texas." 
docs = [text for i in range(0,24)]

features = tokenizer([query for doc in docs], docs, padding='max_length', truncation=False, return_tensors="pt", max_length=128)

with torch.no_grad():
  start = int(time() * 1000)
  scores = model(**features).logits
  duration = int(time() * 1000) - start
  print("Inference took %d ms " % duration )

Simple runs controlling number of threads used to evaluate the model:

OMP_NUM_THREADS=1 python3 test.py 
Inference took 794 ms 
OMP_NUM_THREADS=2 python3 test.py 
Inference took 420 ms 
OMP_NUM_THREADS=4 python3 test.py 
Inference took 226 ms 
OMP_NUM_THREADS=8 python3 test.py 
Inference took 141 ms 
OMP_NUM_THREADS=16 python3 test.py 
Inference took 91 ms 
OMP_NUM_THREADS=32 python3 test.py 
Inference took 98 ms

One wants to balance how many threads are used to evaluate the model with the latency improvement it brings. Like in the above example there was a good latency reduction using 2 threads instead of 1, but not so much from increasing from 8 to 16. If we were using 16 threads instead of 8 we would reduce supported inferences/s throughput by significantly as we would use 16 instead of 8 threads for pretty much the same latency. There is more on this topic in another blog post from the Huggingface team.

We want to balance throughput versus latency so that we find a sweet spot where latency is within our target SLA without hurting throughput too much. This is why we want to expose how many threads are used in the model evaluation so users can find this sweet spot on a per model basis. This similar to OMP_NUM_THREADS usage above. See https://github.com/vespa-engine/vespa/issues/19084 on how we plan to expose this.

Comparing model inference latency without also taking into account the resources (e.g cpu threads) used to achieve it makes little sense.

jobergum commented 2 years ago

https://github.com/vespa-engine/sample-apps/pull/705 introduces an working example using batch re-ranking in the stateless container.

jobergum commented 2 years ago

Mentioned PR merged, go check out https://github.com/vespa-engine/sample-apps/blob/master/msmarco-ranking/passage-ranking.md

See the ReRankingSearcher. Let me know if you have any questions on this before I close this one out. Thank you

vespa-engine / vespa

Batching / Transformer-based Parallelization for Cross-Encoders #18882

CPU:

GPU/FP32