Closed bmw-friedrich-mayr closed 2 years ago
@bmw-friedrich-mayr
I've started on this work and an Reranking in container example will be added to the ms marco ranking sample application. ETA mid next week.
That's great. Thank you @jobergum 🙂!
Just an observation on batching input sequences and cpu serving (Vespa is currently cpu only)
Consider this from Optimizing bert model for intel cpu cores
If you look at the number of sequences/second (Y axis) and compare the batch sizes per input sequence length there is not much change in sequences per second. When comparing batch of sizes 1,2,4,8. For a cross-encoder inputing both the query and the passage sequence length 128 is the relevant part of the graph.
As you probably noticed cross-encoders are very slow in Vespa.
The MiniLM model we used for the MS Marco Passage ranking has 6 layers, using int8 weights
With 12 threads per search with batch size 1 and 24 sequence inputs (rerank count) of 128 sequence length the Vespa backend is able to do 480 inferences per second.
The re-ranking example is ready, plus many other improvements to the mentioned sample application. I've moved all transformer query encoders to the stateless layer and removed the query document type which was used to compute the query encoders. This avoids network round trip and also would make it easier to use auto-scaling features as compute is moved out of the content nodes.
We need some work in Vespa to be able to tune onnx-rt intra threads per model before pushing this to master as one wants to use different number of intra threads per model for optimal performance, e.g the query encoders operate with a short sequence. For example the COLBert multi representation model always uses sequence length 32 due to the padding of mask tokens up to max length, while the dense single representation model is dynamic up to max length. For the latter using more than 1 intra thread is just waste without real benefit. The all to all interaction model with longer sequence length 128 can benefit from more intra threads.
So in short we will need to add support for configuring intra threads per model.
Preliminary benchmarking shows no improvement in the passages/s with batch > 1 over doing batch size 1, maybe even less which is similar results as in the above comment.
Thanks @jobergum. I'm amazed by your fast support. Do I understand correctly that batching has little effect in your experiment? If this is the case, there has to be something wrong. Reranking in Vespa is considerably slower than done with the sentence-transformers library.
Are you sure you pass the full batch to the model at once?
cross_model = '<MODEL>'
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model = AutoModelForSequenceClassification.from_pretrained(cross_model)
tokenizer = AutoTokenizer.from_pretrained(cross_model)
model.eval()
query = "How many citizens has Berlin?"
docs = ["Berlin has a population of 3.5 million in 2016, up slightly from 3.4 million in 2014", ...]
features = tokenizer([query for doc in docs], docs, padding=True, truncation=True, return_tensors="pt", max_length=128)
with torch.no_grad():
scores = model(**features).logits
print(scores)
@bmw-friedrich-mayr,
Yes. I've compared batch size = 24 with different number of intra threads (1,2,4,8,12,24) with batch size 1 with external threads (12) for the same sequence length. By external threads I mean how we evaluate ONNX models in the stateful content nodes where we use multiple threads to run inference but with intra threads set to 1.
See https://github.com/vespa-engine/vespa/issues/18882#issuecomment-916764298, also this blog post report similar results.
To get higher number of inferences per second when increasing batch size from 1 to 4 the latency of batch size 4 needs to be less than 4x the latency of batch size 1, if that is the case you get higher number of inferences per second using batching.
Sequence length 128, Batch = 1, ONNX Runtime CPU 189.56 ms Sequence length 128, Batch = 4, ONNX Runtime CPU 674.05
= 3.55x
Sequence length 256, Batch = 1, ONNX Runtime CPU 360.91 ms Sequence length 256, Batch = 4, ONNX Runtime CPU 1485.55 ms
= 4.1x
Sequence length 128, Batch = 1, ONNX Runtime GPU/FP32 4.30 ms Sequence length 128, Batch = 4, ONNX Runtime GPU/FP32 11.28 ms
= 2.6x
Sequence length 256, Batch = 1, ONNX Runtime GPU/FP32 7.37ms Sequence length 256, Batch = 4, ONNX Runtime GPU/FP32 20.34
= 2.75x
In the case with GPU, you get a much higher inferences per second as compared to CPU with batching, while with CPU you don't have the same effect, with sequence length 256 you get less and with 128, slightly more.
That said, we plan to add GPU which benefits from batching and that might happen sooner than later now that we have support for doing inference at the container layer. Tracked in https://github.com/vespa-engine/vespa/issues/14406
Do I understand correctly that batching has little effect in your experiment?
Batching has little effect in my experiments or other reported experiments online. On CPU that is. GPU is a different story as demonstrated in the comment above.
If this is the case, there has to be something wrong.
Why is that?
Reranking in Vespa is considerably slower than done with the sentence-transformers library.
Your examples with transformers/pytorch will use several threads by default to evaluate the model
Here is an example using a batch 24
from transformers import BertTokenizer, AutoModelForSequenceClassification
import torch
name="cross-encoder/ms-marco-MiniLM-L-6-v2"
model = AutoModelForSequenceClassification.from_pretrained(name)
tokenizer = BertTokenizer.from_pretrained(name)
model.eval()
from time import time
query = "who shot kennedy?"
text = "Lee Harvey Oswald (October 18, 1939 – November 24, 1963) was an American former Marine and Marxist who assassinated United States President John F. Kennedy on November 22, 1963. According to four federal government investigations and one municipal investigation, Oswald shot and killed Kennedy from a sniper's nest as the President traveled by motorcade through Dealey Plaza in the city of Dallas, Texas."
docs = [text for i in range(0,24)]
features = tokenizer([query for doc in docs], docs, padding='max_length', truncation=False, return_tensors="pt", max_length=128)
with torch.no_grad():
start = int(time() * 1000)
scores = model(**features).logits
duration = int(time() * 1000) - start
print("Inference took %d ms " % duration )
Simple runs controlling number of threads used to evaluate the model:
OMP_NUM_THREADS=1 python3 test.py
Inference took 794 ms
OMP_NUM_THREADS=2 python3 test.py
Inference took 420 ms
OMP_NUM_THREADS=4 python3 test.py
Inference took 226 ms
OMP_NUM_THREADS=8 python3 test.py
Inference took 141 ms
OMP_NUM_THREADS=16 python3 test.py
Inference took 91 ms
OMP_NUM_THREADS=32 python3 test.py
Inference took 98 ms
One wants to balance how many threads are used to evaluate the model with the latency improvement it brings. Like in the above example there was a good latency reduction using 2 threads instead of 1, but not so much from increasing from 8 to 16. If we were using 16 threads instead of 8 we would reduce supported inferences/s throughput by significantly as we would use 16 instead of 8 threads for pretty much the same latency. There is more on this topic in another blog post from the Huggingface team.
We want to balance throughput versus latency so that we find a sweet spot where latency is within our target SLA without hurting throughput too much. This is why we want to expose how many threads are used in the model evaluation so users can find this sweet spot on a per model basis. This similar to OMP_NUM_THREADS usage above. See https://github.com/vespa-engine/vespa/issues/19084 on how we plan to expose this.
Comparing model inference latency without also taking into account the resources (e.g cpu threads) used to achieve it makes little sense.
https://github.com/vespa-engine/sample-apps/pull/705 introduces an working example using batch re-ranking in the stateless container.
Mentioned PR merged, go check out https://github.com/vespa-engine/sample-apps/blob/master/msmarco-ranking/passage-ranking.md
See the ReRankingSearcher. Let me know if you have any questions on this before I close this one out. Thank you
I opened this feature-request on the basis of this slack thread and @jobergum 's comment.
As mentioned by many Vespa users in slack, it would be awesome if there was an example for batching for cross-encoders. (Maybe even with 100 documents for reranking, as batching speeds up cross-encoders tremendously)
As you probably noticed cross-encoders are very slow in Vespa. I know cross-encoder are considerably slower than bi-encoders. But I think Vespa misses a crucial thing that speeds them up tremendously, batching. Transformers are able to process multiple inputs in parallel. From my point of view Vespa fetches only one sample to a cross-encoder, which is detrimental for performance.
Parallel calculation with Hugging Face: