metal -> CPU Tensor::to_device is slow for model output

When using sentence-transformers/all-MiniLM-L6-v2 on m1 PRO mac (32GB ram) with --features metal on, attempting to generate embeddings for 1000 sentences and attempting to convert the final tensor to a Vec<Vec> the to_device function of Tensor seems very slow.

I believe to_device is important and must be performant in order to switch from Tensor to Vec and potentially Serialize the output from a model in an application

How to reproduce:

place the following code snipet (below) in candle-examples/examples/bert-metal/main.rs cargo run --example bert-metal --features metal

 use candle_transformers::models::bert::{BertModel, Config, DTYPE};

use anyhow::{Error as E, Result};
use candle::{Device, Tensor};
use candle_nn::VarBuilder;
use hf_hub::{api::sync::Api, Repo, RepoType};
use tokenizers::{PaddingParams, Tokenizer};

fn build_model_and_tokenizer() -> Result<(BertModel, Tokenizer)> {
    let device = candle_examples::device(false)?;
    let repo = Repo::with_revision(
        "sentence-transformers/all-MiniLM-L6-v2".to_string(),
        RepoType::Model,
        "refs/pr/21".to_string(),
    );
    let (config_filename, tokenizer_filename, weights_filename) = {
        let api = Api::new()?.repo(repo);
        let config = api.get("config.json")?;
        let tokenizer = api.get("tokenizer.json")?;
        let weights = api.get("model.safetensors")?;
        (config, tokenizer, weights)
    };
    let config = std::fs::read_to_string(config_filename)?;
    let config: Config = serde_json::from_str(&config)?;
    let tokenizer = Tokenizer::from_file(tokenizer_filename).map_err(E::msg)?;
    let vb = unsafe { VarBuilder::from_mmaped_safetensors(&[weights_filename], DTYPE, &device)? };
    let model = BertModel::load(vb, &config)?;
    Ok((model, tokenizer))
}

pub fn normalize_l2(v: &Tensor) -> Result<Tensor> {
    Ok(v.broadcast_div(&v.sqr()?.sum_keepdim(1)?.sqrt()?)?)
}

fn get_mock_2d_tensor(vec_count: usize, vec_size: usize, d: &Device) -> Tensor {
    let mut data = Vec::with_capacity(vec_count);
    for i in 0..vec_count {
        data.push(vec![i as f32; vec_size]);
    }
    Tensor::new(data, d).expect("failed to generate tensor")
}

fn main() -> Result<()> {
    let (model, mut tokenizer) = build_model_and_tokenizer()?;
    let device = &model.device;
    let sentence_count = 1_000;
    let sentences: Vec<&str> =
        vec!["This is a sample sentence that should be short enough"; sentence_count];
    let pp = PaddingParams {
        strategy: tokenizers::PaddingStrategy::Fixed(128),
        ..Default::default()
    };
    tokenizer.with_padding(Some(pp));
    let tokens = tokenizer.encode_batch(sentences, true).map_err(E::msg)?;
    let mut token_ids = Vec::with_capacity(tokens.len());
    for encoding in tokens {
        token_ids.push(encoding.get_ids().to_vec());
    }
    let token_ids = Tensor::new(token_ids, device)?;
    let token_type_ids = token_ids.zeros_like()?;

    println!("running inference on batch {:?}", token_ids.shape());
    let start = std::time::Instant::now();
    let embeddings = model.forward(&token_ids, &token_type_ids)?;
    println!("model forward timing {:?}", start.elapsed());
    let start = std::time::Instant::now();
    let (_n_sentence, n_tokens, _hidden_size) = embeddings.dims3()?;
    let embeddings = (embeddings.sum(1)? / (n_tokens as f64))?;
    let embeddings = normalize_l2(&embeddings)?;
    println!(
        "pooled embeddings {:?} in {:?}",
        embeddings.shape(),
        start.elapsed()
    );
    let start = std::time::Instant::now();
    embeddings.to_device(&Device::Cpu)?;
    println!(
        "moved embedding tensor back to cpu in {:?}",
        start.elapsed()
    );

    let mock_tensor = get_mock_2d_tensor(1000, 384, device);
    let start = std::time::Instant::now();
    let old_device = mock_tensor.device();
    mock_tensor.to_device(&Device::Cpu)?;
    println!(
        "moved mock tensor of size {:?} from {:?} back to cpu in {:?}",
        mock_tensor.shape(),
        old_device,
        start.elapsed()
    );
    Ok(())
}

output

running inference on batch [1000, 128]
model forward timing 185.251708ms
pooled embeddings [1000, 384] in 233.083µs
moved embedding tensor back to cpu in 18.590362041s
moved mock tensor of size [1000, 384] from Metal(MetalDevice(4294969642)) back to cpu in 517.167µs

I added a sample move of a manually generated tensor at the end to show that the issue might not be related to the size of the tensor.

Investigation

The issue might be related to theMetalStorage.to_cpu, when debugging I added a self.device.wait_until_completed statement at the top of the function and found that most of the time was spent there, could there be commands waiting to be flushed?
Is there a way to generate a new Tensor.storage or Tensor.storage.device that would only contain the [1000, 384] floats for an existing tensor? with a fresh command_queue, command_buffer and buffersfrom an existing Tensor to maybe bypass this issue?

I'm not a specialist of the metal api but I suspect it's asynchronous so candle schedule commands that run in the background on the gpu and it's blocking when candle tries to access the results and move them back to cpu. So your timings will say that most of the time is spent in to_device but actually this is measuring the whole computation. That's how things work in the cuda world and tools like nvsys allow you see what is actually running on the gpu. I imagine such tools also exist for metal but I wouldn't know of them (in the cuda world one can also set CUDA_LAUNCH_BLOCKING to get all the calls to be blocking but this has a significant impact on performance).

I am wondering is what are the pending GPU commands are, are they needed for the purpose of extracting the final tensor from the GPU to the CPU (only the dimension we need)? could we ignore these operations or are they part of the work done by model.encode in which case why don't model.encode wait for the GPU operations to complete?

After running the model.encode function my naive expectation is that the final Tensor has been calculated by the GPU and no further calculation should be pending.

This seems to me like a candle specific issue since when running the same model in pytorch (on the same hardware):

from typing import List, Optional
import time
import torch
from sentence_transformers import SentenceTransformer

def best_device():
    """In the following order mps (m1 macs), cuda, cpu (default)
    return the first available device detected

    Returns:
        torch.device: the detected torch device
    """

    if torch.backends.mps.is_available() and torch.backends.mps.is_built():
        print("Using MPS")
        return torch.device("mps")
    if torch.cuda.is_available():
        return torch.device("cuda")
    return torch.device("cpu")

class SentenceTransformerVectorizer:
    __slots__ = "_model", "_device"
    _model: SentenceTransformer
    _device: torch.device

    def __init__(self, model_path: str, device: Optional[torch.device] = None):
        if device is None:
            device = best_device()
        self._device = device
        self._model = SentenceTransformer(model_path, device=device)
        self._model.eval()  # make sure we're in inference mode, not training

    def vectorize(self, batch: List[str]):
        return self._model.encode(
            batch, batch_size=min(1000, len(batch)), device=self._device
        )

texts = ["this is some random text x 1000"] * 1_000
vectorizer = SentenceTransformerVectorizer("all-MiniLM-L6-v2", best_device())
start = time.monotonic()
output = vectorizer.vectorize(texts)
elapsed = time.monotonic() - start
print(f"Elapsed time: {elapsed:.2f} seconds with shape: {output.shape}")

the embeddings does not experience anything close to that 18seconds delay

Output:

Using MPS
Elapsed time: 0.32 seconds with shape: (1000, 384)

After running the model.encode function my naive expectation is that the final Tensor has been calculated by the GPU and no further calculation should be pending.

Which model.encode is this? What I just hinted at (based on how cuda works, again not sure of what metal would do) is that the rust function would mostly return instantaneously until you hit a synchronisation point, at which point you would have to wait for the gpu computations to actually finish. The timings that you would get would not really represent which operations are blocking, for that you may want to use some specialize metal kernel profiler. Again on the cuda side one would use nvsys to get a sense of what the GPU is doing, I don't know what the equivalent for metal is. That wouldn't explain why the model is so much slower than pytorch, but could help pinpoint which kernel is actually problematic here as it may well not be the one copying data to the cpu. Also maybe you want to run with --release if it's not already the case, otherwise cpu optimisations are not turned on.

I apologize, I meant model.forward (not model.encode) where model is an instance of candle_transformers::models::bert::BertModel.

So the logical flow is

Load BertModel to GPU
Tokenize some inputs (CPU task)
add tokenization output to GPU as Tensor (A)
apply BertModel::forward to Tensor A (GPU task, that launches a lot of GPU async ops) -- 18 seconds are spent here
collect output of BertModel Tensor (B) back to GPU (GPU -> CPU transfer)

If i understand what you are saying correctly, step 4 launches a set of tasks asynchronously and when step 5 is called, in order to collect the final tensor all the tasks launched by step 4, have to be completed.

My question is then: assuming we NEED to wait for all the tasks launched by step A in order to get the final tensor, why do these tasks take 18 seconds when pytorch can perform similar tasks on metal in ms?

is there a way to open a different GPU command queue to get the final tensor without waiting for all the async operations from step 4 (only wait for the operations required to calculate the final vector)? right after step 4, if i do Tensor B .shape it returns the correct shape (without waiting 18 seconds), so my naive assumption would be that the calculations required to get the final vector have been completed ... but maybe I am wrong, perhaps the tensor shape is calculated before the tensor values are ...

Getting the tensor shape doesn't communicate with the gpu - it's fully handled on the cpu side, only retrieving its content is actually async. If it's like cuda, we would wait only for the necessary task to complete to get the data out, not all the task - that would not be handled at the candle level but rather within cuda. From here it seems that some kernel is very slow and so we would have to profile what is going on on the gpu side to get a sense of where the time is spent, hence my suggestion to use an equivalent of nvsys. There is a metal performance debugger but I've never tried it. Also did you re-run with --release just in case cpu optimizations actually help here?

yep i ran with --release, same 18seconds delay.

Ok I will (try to) look into trying to profiling operations, by using xcode, following the steps outlined in metal-rs guide

I was able to generate a GPU trace, the trace is too large (+1G, compressed 125 MB) so if interested you can find it here. Not sure where to go from here ...

Findings from GPU trace

the amount of memory used is very larged 42GB especially for such a tiny task

the debugger reports 416 insights, most of them being LateMTLBufferCreation with a few Late MTLComputePipelineState creation and Late MTLLibrary creation
When I tried to get the Performance, my computer freezes and eventually runs out of memory so I could not get a timeline, maybe someone with a better machine can run the performance on the trace

How to reproduce on Mac only

I ran the code below to generate a Metal GPU trace (I followed the step outline here to run the executable in XCode).

note:

the root of the XCode project is the root of the candle repo
Xcode command args build --example bert-metal --release --features metal
I added metal to by adding metal = { workspace = true } in the dependencies in candle-examples/Cargo.toml
don 't forget to set the output path for the gputrace

// candle-examples/examples/bert-metal/main.rs
use candle_transformers::models::bert::{BertModel, Config, DTYPE};

use anyhow::{Error as E, Result};
use candle::{Device, Tensor};
use candle_nn::VarBuilder;
use hf_hub::{api::sync::Api, Repo, RepoType};
use metal::{self, CaptureScope};
use tokenizers::{PaddingParams, Tokenizer};

fn build_model_and_tokenizer() -> Result<(BertModel, Tokenizer)> {
    let device = candle_examples::device(false)?;
    let repo = Repo::with_revision(
        "sentence-transformers/all-MiniLM-L6-v2".to_string(),
        RepoType::Model,
        "refs/pr/21".to_string(),
    );
    let (config_filename, tokenizer_filename, weights_filename) = {
        let api = Api::new()?.repo(repo);
        let config = api.get("config.json")?;
        let tokenizer = api.get("tokenizer.json")?;
        let weights = api.get("model.safetensors")?;
        (config, tokenizer, weights)
    };
    let config = std::fs::read_to_string(config_filename)?;
    let config: Config = serde_json::from_str(&config)?;
    let tokenizer = Tokenizer::from_file(tokenizer_filename).map_err(E::msg)?;
    let vb = unsafe { VarBuilder::from_mmaped_safetensors(&[weights_filename], DTYPE, &device)? };
    let model = BertModel::load(vb, &config)?;
    Ok((model, tokenizer))
}

fn start_metal_debug() -> CaptureScope {
    let capture_scope = metal::CaptureManager::shared()
    .new_capture_scope_with_device(&metal::Device::system_default().unwrap());

    let capture_descriptor = metal::CaptureDescriptor::new();
    capture_descriptor.set_capture_scope(&capture_scope);
    capture_descriptor.set_output_url(std::path::Path::new(
    // you might need to update this value
    // make sure it does not already exist
    // <SET_OUTPUT_PATH_HERE>
    // example: `"~/Users/<YOU>/temp/framecapture.gputrace",
    ));
    capture_descriptor.set_destination(metal::MTLCaptureDestination::GpuTraceDocument);

    metal::CaptureManager::shared().start_capture(&capture_descriptor).unwrap();
    capture_scope.begin_scope();
    capture_scope
}

fn end_metal_debug(capture_scope: CaptureScope) {
    capture_scope.end_scope()
}

fn main() -> Result<()> {
    let scope = start_metal_debug();
    let (model, mut tokenizer) = build_model_and_tokenizer()?;
    let device = &model.device;
    let sentence_count = 1000;
    let pp = PaddingParams {
        strategy: tokenizers::PaddingStrategy::Fixed(128),
        ..Default::default()
    };
    tokenizer.with_padding(Some(pp));
    let sentences: Vec<&str> =
        vec!["This is a sample sentence that should be short enough"; sentence_count];
    let tokens = tokenizer.encode_batch(sentences, true).map_err(E::msg)?;
    let mut token_ids = Vec::with_capacity(tokens.len());
    for encoding in tokens {
        token_ids.push(encoding.get_ids().to_vec());
    }
    let token_ids = Tensor::new(token_ids, device)?;
    let token_type_ids = token_ids.zeros_like()?;

    println!("running inference on batch {:?}", token_ids.shape());
    let start = std::time::Instant::now();
    let embeddings = model.forward(&token_ids, &token_type_ids)?;
    println!("model forward timing {:?}", start.elapsed());

    let start = std::time::Instant::now();
    let old_device = embeddings.device();
    let embeddings = embeddings.to_device(&Device::Cpu)?;
    println!(
        "moved embedding with shape {:?} from {:?} back to cpu in {:?}",
        embeddings.shape(),
        old_device,
        start.elapsed()
    );
    end_metal_debug(scope);
    Ok(())
}

huggingface / candle