huggingface / candle

Minimalist ML framework for Rust
Apache License 2.0
14.24k stars 797 forks source link

Tensor::to_scalar very high latency #2239

Open RoggeOhta opened 4 weeks ago

RoggeOhta commented 4 weeks ago

Bug description: When converting a cuda tensor, which is indexed from an other large tensor to CPU memory scalar, the time consumption of this process is unusually long.

Minimal reproducible example:

use candle_core::{Device, IndexOp, Tensor};
use std::time::Instant;

fn time_it<T>(func: impl FnOnce() -> T) -> T {
    let start = Instant::now();
    let result = func();
    let duration = start.elapsed();
    println!("{:?}", duration);
    result
}

fn main() {
    let cuda_dev = Device::new_cuda(0).unwrap();
    let cpu_dev = Device::Cpu;

    let shape = [10000, 10000];
    let tensor = Tensor::rand(0f32, 1f32, &shape, &cpu_dev).unwrap();
    println!("Tensor on CPU: {:?}", tensor);

    let tensor = time_it(|| tensor.to_device(&cuda_dev).unwrap());
    println!("Tensor on GPU: {:?}", tensor.layout());

    let cell = time_it(|| tensor.i((42, 42)).unwrap());

    let scalar = time_it(|| cell.to_scalar::<f32>().unwrap());
    dbg!(scalar);
}

In the most recent crate version v0.5.1, the line let scalar = time_it(|| cell.to_scalar::<f32>().unwrap()); will have the identical time consumption of line let tensor = time_it(|| tensor.to_device(&cuda_dev).unwrap());.

In my investigation, the problem is because candle's tensor uses a storage + layout approach to represent slice, this approach in most of the time is effective, but it's inefficient when copying a small slice from cuda mem to CPU mem because it will still copy the full storage and use layout to offset to the slice you want. this creates massive latency(Before my attempt to fix ~300ms, after my attempt to fix ~17us).

LaurentMazare commented 4 weeks ago

When computing the loss, I would expect the value not to be part of a larger tensor but rather being computed as the sum over some other tensors so I wouldn't expect the issue that you mention to be an issue there. Instead I think this is likely to be caused by cuda being an asynchronous api so all the time appears to be spent when retrieving the data rather than when doing the ops themselves. Maybe you can try to run with CUDA_LAUNCH_BLOCKING=1 on your tinyllama example to see if it still appears that the time is spent retrieving the loss?

RoggeOhta commented 4 weeks ago

When computing the loss, I would expect the value not to be part of a larger tensor but rather being computed as the sum over some other tensors so I wouldn't expect the issue that you mention to be an issue there. Instead I think this is likely to be caused by cuda being an asynchronous api so all the time appears to be spent when retrieving the data rather than when doing the ops themselves. Maybe you can try to run with CUDA_LAUNCH_BLOCKING=1 on your tinyllama example to see if it still appears that the time is spent retrieving the loss?

I think this will be an issue because this is counterintuitive. I want to convert a single value tensor(In my view that tensor is a single value, and I may not know where this tensor's origin comes from) but actually, it copies all the underlying storage to the CPU and then combine the layout to give me the answer. This is not efficient and causing real performance trouble

LaurentMazare commented 4 weeks ago

I agree but I think that might be a separate issue compared to what you describe on the loss, I'm just wondering if the issue on the loss has the same origin or a different cause.

RoggeOhta commented 4 weeks ago

I agree but I think that might be a separate issue compared to what you describe on the loss, I'm just wondering if the issue on the loss has the same origin or a different cause.

Oh I see, the description may look a bit hazy to others. I mentioned the loss problem because I found this issue when investigating why my training loop is so slow. then it came down to be when I tried to convert loss to f32. the process of converting a single value tensor to f32 is unbelievably causing ~300ms. Then I spent a few days reading the candle source code and then producing a fix.

LaurentMazare commented 4 weeks ago

Gotcha thanks, I'll have a look at the PR.

RoggeOhta commented 2 weeks ago

any updates?