Open aorticweb opened 8 months ago
All GPUs operations are asynchronous so this difference may be an artifact of this. See https://pytorch.org/tutorials/recipes/recipes/benchmark.html#pytorch-benchmark for example.
It's expensive because in this case the tensor is of size [N_TOKENS, HIDDEN_SIZE]
instead of [BATCH_SIZE, HIDDEN_SIZE]
.
It is the bottlneck because the GPU<->CPU bus has a low throughput.
This transfer should happen on a different stream from the compute stream to make it asynchronous and avoid stalling the GPU. This is an improvement I have in mind but have yet to implement.
Context
You mentioned here that this operation is expensive, Is this a candle issue? any idea why? or how to solve this?
Running on
m1 pro mac 32GB mem with --features metal on
Speculations
after doing some loose "experimentations", I realized that the move from the GPU to the CPU is the rate limiting step here. but I am not sure why.
I ran the following test:
load
sentence-transformers/all-MiniLM-L6-v2
model in candlegenerate embedding for 1000 sentences creating a tensor A of size [1000, 384] (I extracted the embeddings by doing
generate a tensor B manually and place it on the GPU
where d is a GPU device
moving A from GPU to CPU --> 18s
moving B from GPU to CPU --> 1.79ms (yes milliseconds)
Although the tensors have the same size ...