Open aorticweb opened 4 months ago
I'm not a specialist of the metal api but I suspect it's asynchronous so candle schedule commands that run in the background on the gpu and it's blocking when candle tries to access the results and move them back to cpu. So your timings will say that most of the time is spent in to_device
but actually this is measuring the whole computation.
That's how things work in the cuda world and tools like nvsys
allow you see what is actually running on the gpu. I imagine such tools also exist for metal but I wouldn't know of them (in the cuda world one can also set CUDA_LAUNCH_BLOCKING
to get all the calls to be blocking but this has a significant impact on performance).
I am wondering is what are the pending GPU commands are, are they needed for the purpose of extracting the final tensor from the GPU to the CPU (only the dimension we need)? could we ignore these operations or are they part of the work done by model.encode in which case why don't model.encode wait for the GPU operations to complete?
After running the model.encode function my naive expectation is that the final Tensor has been calculated by the GPU and no further calculation should be pending.
This seems to me like a candle specific issue since when running the same model in pytorch (on the same hardware):
from typing import List, Optional
import time
import torch
from sentence_transformers import SentenceTransformer
def best_device():
"""In the following order mps (m1 macs), cuda, cpu (default)
return the first available device detected
Returns:
torch.device: the detected torch device
"""
if torch.backends.mps.is_available() and torch.backends.mps.is_built():
print("Using MPS")
return torch.device("mps")
if torch.cuda.is_available():
return torch.device("cuda")
return torch.device("cpu")
class SentenceTransformerVectorizer:
__slots__ = "_model", "_device"
_model: SentenceTransformer
_device: torch.device
def __init__(self, model_path: str, device: Optional[torch.device] = None):
if device is None:
device = best_device()
self._device = device
self._model = SentenceTransformer(model_path, device=device)
self._model.eval() # make sure we're in inference mode, not training
def vectorize(self, batch: List[str]):
return self._model.encode(
batch, batch_size=min(1000, len(batch)), device=self._device
)
texts = ["this is some random text x 1000"] * 1_000
vectorizer = SentenceTransformerVectorizer("all-MiniLM-L6-v2", best_device())
start = time.monotonic()
output = vectorizer.vectorize(texts)
elapsed = time.monotonic() - start
print(f"Elapsed time: {elapsed:.2f} seconds with shape: {output.shape}")
the embeddings does not experience anything close to that 18seconds delay
Output:
Using MPS
Elapsed time: 0.32 seconds with shape: (1000, 384)
After running the model.encode function my naive expectation is that the final Tensor has been calculated by the GPU and no further calculation should be pending.
Which model.encode
is this? What I just hinted at (based on how cuda works, again not sure of what metal would do) is that the rust function would mostly return instantaneously until you hit a synchronisation point, at which point you would have to wait for the gpu computations to actually finish. The timings that you would get would not really represent which operations are blocking, for that you may want to use some specialize metal kernel profiler. Again on the cuda side one would use nvsys to get a sense of what the GPU is doing, I don't know what the equivalent for metal is.
That wouldn't explain why the model is so much slower than pytorch, but could help pinpoint which kernel is actually problematic here as it may well not be the one copying data to the cpu.
Also maybe you want to run with --release
if it's not already the case, otherwise cpu optimisations are not turned on.
I apologize, I meant model.forward
(not model.encode) where model is an instance of candle_transformers::models::bert::BertModel
.
So the logical flow is
If i understand what you are saying correctly, step 4 launches a set of tasks asynchronously and when step 5 is called, in order to collect the final tensor all the tasks launched by step 4, have to be completed.
My question is then: assuming we NEED to wait for all the tasks launched by step A in order to get the final tensor, why do these tasks take 18 seconds when pytorch can perform similar tasks on metal in ms?
OR
is there a way to open a different GPU command queue to get the final tensor without waiting for all the async operations from step 4 (only wait for the operations required to calculate the final vector)? right after step 4, if i do Tensor B .shape it returns the correct shape (without waiting 18 seconds), so my naive assumption would be that the calculations required to get the final vector have been completed ... but maybe I am wrong, perhaps the tensor shape is calculated before the tensor values are ...
Getting the tensor shape doesn't communicate with the gpu - it's fully handled on the cpu side, only retrieving its content is actually async. If it's like cuda, we would wait only for the necessary task to complete to get the data out, not all the task - that would not be handled at the candle level but rather within cuda.
From here it seems that some kernel is very slow and so we would have to profile what is going on on the gpu side to get a sense of where the time is spent, hence my suggestion to use an equivalent of nvsys. There is a metal performance debugger but I've never tried it.
Also did you re-run with --release
just in case cpu optimizations actually help here?
yep i ran with --release, same 18seconds delay.
Ok I will (try to) look into trying to profiling operations, by using xcode, following the steps outlined in metal-rs guide
I was able to generate a GPU trace, the trace is too large (+1G, compressed 125 MB) so if interested you can find it here. Not sure where to go from here ...
the debugger reports 416 insights, most of them being LateMTLBufferCreation
with a few Late MTLComputePipelineState creation
and Late MTLLibrary creation
When I tried to get the Performance, my computer freezes and eventually runs out of memory so I could not get a timeline, maybe someone with a better machine can run the performance on the trace
I ran the code below to generate a Metal GPU trace (I followed the step outline here to run the executable in XCode).
note:
build --example bert-metal --release --features metal
metal = { workspace = true }
in the dependencies in candle-examples/Cargo.toml// candle-examples/examples/bert-metal/main.rs
use candle_transformers::models::bert::{BertModel, Config, DTYPE};
use anyhow::{Error as E, Result};
use candle::{Device, Tensor};
use candle_nn::VarBuilder;
use hf_hub::{api::sync::Api, Repo, RepoType};
use metal::{self, CaptureScope};
use tokenizers::{PaddingParams, Tokenizer};
fn build_model_and_tokenizer() -> Result<(BertModel, Tokenizer)> {
let device = candle_examples::device(false)?;
let repo = Repo::with_revision(
"sentence-transformers/all-MiniLM-L6-v2".to_string(),
RepoType::Model,
"refs/pr/21".to_string(),
);
let (config_filename, tokenizer_filename, weights_filename) = {
let api = Api::new()?.repo(repo);
let config = api.get("config.json")?;
let tokenizer = api.get("tokenizer.json")?;
let weights = api.get("model.safetensors")?;
(config, tokenizer, weights)
};
let config = std::fs::read_to_string(config_filename)?;
let config: Config = serde_json::from_str(&config)?;
let tokenizer = Tokenizer::from_file(tokenizer_filename).map_err(E::msg)?;
let vb = unsafe { VarBuilder::from_mmaped_safetensors(&[weights_filename], DTYPE, &device)? };
let model = BertModel::load(vb, &config)?;
Ok((model, tokenizer))
}
fn start_metal_debug() -> CaptureScope {
let capture_scope = metal::CaptureManager::shared()
.new_capture_scope_with_device(&metal::Device::system_default().unwrap());
let capture_descriptor = metal::CaptureDescriptor::new();
capture_descriptor.set_capture_scope(&capture_scope);
capture_descriptor.set_output_url(std::path::Path::new(
// you might need to update this value
// make sure it does not already exist
// <SET_OUTPUT_PATH_HERE>
// example: `"~/Users/<YOU>/temp/framecapture.gputrace",
));
capture_descriptor.set_destination(metal::MTLCaptureDestination::GpuTraceDocument);
metal::CaptureManager::shared().start_capture(&capture_descriptor).unwrap();
capture_scope.begin_scope();
capture_scope
}
fn end_metal_debug(capture_scope: CaptureScope) {
capture_scope.end_scope()
}
fn main() -> Result<()> {
let scope = start_metal_debug();
let (model, mut tokenizer) = build_model_and_tokenizer()?;
let device = &model.device;
let sentence_count = 1000;
let pp = PaddingParams {
strategy: tokenizers::PaddingStrategy::Fixed(128),
..Default::default()
};
tokenizer.with_padding(Some(pp));
let sentences: Vec<&str> =
vec!["This is a sample sentence that should be short enough"; sentence_count];
let tokens = tokenizer.encode_batch(sentences, true).map_err(E::msg)?;
let mut token_ids = Vec::with_capacity(tokens.len());
for encoding in tokens {
token_ids.push(encoding.get_ids().to_vec());
}
let token_ids = Tensor::new(token_ids, device)?;
let token_type_ids = token_ids.zeros_like()?;
println!("running inference on batch {:?}", token_ids.shape());
let start = std::time::Instant::now();
let embeddings = model.forward(&token_ids, &token_type_ids)?;
println!("model forward timing {:?}", start.elapsed());
let start = std::time::Instant::now();
let old_device = embeddings.device();
let embeddings = embeddings.to_device(&Device::Cpu)?;
println!(
"moved embedding with shape {:?} from {:?} back to cpu in {:?}",
embeddings.shape(),
old_device,
start.elapsed()
);
end_metal_debug(scope);
Ok(())
}
When using> the
sentence-transformers/all-MiniLM-L6-v2
on m1 PRO mac (32GB ram) with --features metal on, attempting to generate embeddings for 1000 sentences and attempting to convert the final tensor to a Vec<Vecto_device
function of Tensor seems very slow.I believe to_device is important and must be performant in order to switch from Tensor to Vec and potentially Serialize the output from a model in an application
How to reproduce:
place the following code snipet (below) in candle-examples/examples/bert-metal/main.rs cargo run --example bert-metal --features metal
output
I added a sample move of a manually generated tensor at the end to show that the issue might not be related to the size of the tensor.
Investigation
The issue might be related to the
MetalStorage.to_cpu
, when debugging I added aself.device.wait_until_completed
statement at the top of the function and found that most of the time was spent there, could there be commands waiting to be flushed?Is there a way to generate a new Tensor.storage or Tensor.storage.device that would only contain the [1000, 384] floats for an existing tensor? with a fresh
command_queue
,command_buffer
andbuffers
from an existing Tensor to maybe bypass this issue?