Context on slow Tensor to Vec conversion

aorticweb commented 8 months ago

Context

You mentioned here that this operation is expensive, Is this a candle issue? any idea why? or how to solve this?

Running on

m1 pro mac 32GB mem with --features metal on

Speculations

after doing some loose "experimentations", I realized that the move from the GPU to the CPU is the rate limiting step here. but I am not sure why.

I ran the following test:

load sentence-transformers/all-MiniLM-L6-v2 model in candle
generate embedding for 1000 sentences creating a tensor A of size [1000, 384] (I extracted the embeddings by doing
- let (_n_sentence, n_tokens, _hidden_size) = tensor.dims3()?;
- let tensor = (tensor.sum(1)? / (n_tokens as f64))?; )

generate a tensor B manually and place it on the GPU

for i in 0..1000 {
    data.push(vec![i as f32; 384]);
}
Tensor::new(data, d).expect("failed to generate tensor")

where d is a GPU device

moving A from GPU to CPU --> 18s
moving B from GPU to CPU --> 1.79ms (yes milliseconds)

Although the tensors have the same size ...

OlivierDehaene commented 8 months ago

All GPUs operations are asynchronous so this difference may be an artifact of this. See https://pytorch.org/tutorials/recipes/recipes/benchmark.html#pytorch-benchmark for example.

It's expensive because in this case the tensor is of size [N_TOKENS, HIDDEN_SIZE] instead of [BATCH_SIZE, HIDDEN_SIZE]. It is the bottlneck because the GPU<->CPU bus has a low throughput.

This transfer should happen on a different stream from the compute stream to make it asynchronous and avoid stalling the GPU. This is an improvement I have in mind but have yet to implement.

aorticweb commented 8 months ago

do you have some pointers (guide/docs) regarding how to move the transfer on a different stream?

could this me related to this?

huggingface / text-embeddings-inference