huggingface / text-embeddings-inference

A blazing fast inference solution for text embeddings models
https://huggingface.co/docs/text-embeddings-inference/quick_tour
Apache License 2.0
2.84k stars 177 forks source link

Context on slow Tensor to Vec conversion #175

Open aorticweb opened 8 months ago

aorticweb commented 8 months ago

Context

You mentioned here that this operation is expensive, Is this a candle issue? any idea why? or how to solve this?

Running on

m1 pro mac 32GB mem with --features metal on

Speculations

after doing some loose "experimentations", I realized that the move from the GPU to the CPU is the rate limiting step here. but I am not sure why.

I ran the following test:

Although the tensors have the same size ...

OlivierDehaene commented 8 months ago

All GPUs operations are asynchronous so this difference may be an artifact of this. See https://pytorch.org/tutorials/recipes/recipes/benchmark.html#pytorch-benchmark for example.

It's expensive because in this case the tensor is of size [N_TOKENS, HIDDEN_SIZE] instead of [BATCH_SIZE, HIDDEN_SIZE]. It is the bottlneck because the GPU<->CPU bus has a low throughput.

This transfer should happen on a different stream from the compute stream to make it asynchronous and avoid stalling the GPU. This is an improvement I have in mind but have yet to implement.

aorticweb commented 8 months ago

do you have some pointers (guide/docs) regarding how to move the transfer on a different stream?

could this me related to this?