huggingface / candle

Minimalist ML framework for Rust
Apache License 2.0
15.8k stars 949 forks source link

broadcast_add is very slow #2499

Open okpatil4u opened 1 month ago

okpatil4u commented 1 month ago

Rust code

use candle_core::Tensor;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let a = Tensor::rand(0f32, 1.0, (32,630, 12,32), &candle_core::Device::Cpu)?;
    let b = Tensor::rand(0f32, 1.0, (32,1, 1, 32), &candle_core::Device::Cpu)?;

    let start = std::time::Instant::now();
    for _ in 0..100{
        let _ = a.broadcast_add(&b);
    }
    println!("broadcast add  : {:?}",std::time::Instant::now()-start);

    Ok(())
}

broadcast add  : 3.317046333s

Python Code

import torch
import time

a = torch.rand(32,630,12,32)
c = torch.rand(32, 1, 1, 32)

start = time.time()
for i in range(100):
    out = a+c
print("broadcast add : ",time.time() - start)

broadcast add :  0.06077408790588379

Python version is around 55 times faster for a simple addition function. Is there a pathway ahead to solve this problem ?

xiaoniaoyouhuajiang commented 4 weeks ago

I used modified code (with the loop count adjusted to 10 times), in conjunction with the following cargo.toml and the flamegraph tool, to obtain this flame graph. A significant portion of the time cost appears to be the iterator overhead in the binary_map function, specifically alloc::vec::spec_from_iter::SpecFromIter<T,I>>::from_iter.

fn main() -> Result<()> {
    let a = Tensor::rand(0f32, 1.0, (32,630, 12,32), &candle_core::Device::Cpu)?;
    let b = Tensor::rand(0f32, 1.0, (32,1, 1, 32), &candle_core::Device::Cpu)?;

    let start = std::time::Instant::now();
    for _ in 0..10{
        let _ = a.broadcast_add(&b);
    }
    println!("broadcast add  : {:?}",std::time::Instant::now()-start);

    Ok(())
}

broadcast add : 441.14726ms

image

xiaoniaoyouhuajiang commented 4 weeks ago

I modified the loop count of broadcast_add in the code to 10 times and tested the performance of both optimized and unoptimized binary products, comparing the results with the performance of the Python code.

Experimental Conditions: my PC CPU:

OS: ubuntu 20.04
Architecture:                    x86_64
CPU Operating Modes:          32-bit, 64-bit
Address sizes:                 43 bits physical, 48 bits virtual
Byte Order:                    Little Endian
CPU:                            12
  Online CPU(s) list:          0-11
Vendor ID:                     AuthenticAMD
  Model Name:                  AMD Ryzen 5 3600 6-Core Processor
    CPU Family:               23
    Model:                     113
    Thread(s) per core:       2
    Core(s) per socket:       6
    Socket(s):                 1
    Stepping:                  0
    Frequency boost:          enabled

compiler setting

[profile.release]
opt-level = 2
lto = false
debug = true
panic = 'abort'

and this is my result: optimized:broadcast add(10x loop) : 447.87228ms optimized:broadcast add(100x) : 3.983712414s python code: broadcast add(10x) : 0.04471111297607422(44.71111297607422ms) python code: broadcast add(100x) : 0.4770965576171875(477.0965576171875ms)

Obviously, Python code is 10x faster than candle

I used strace to trace the system calls of both implementations and found that the Python version of broadcast seems to use multi-threading to distribute tensor operations (in this case, 48 threads). I am not sure if this implementation mechanism improves computational efficiency. In contrast, the Rust code in this case used only one thread/logical core. I tested whether multi-threading could enhance the efficiency of the code in this scenario using Rayon.


strace --follow-forks --summary-only python3 broadcast_add.py

strace: Process 66105 attached
strace: Process 66106 attached
strace: Process 66107 attached
strace: Process 66108 attached
strace: Process 66109 attached
strace: Process 66110 attached
strace: Process 66111 attached
strace: Process 66112 attached
strace: Process 66113 attached
strace: Process 66114 attached
...
// modified code
use rayon::prelude::*;
use std::time::Instant;

fn main() -> Result<()> {
    let a = Arc::new(Tensor::rand(0f32, 1.0, (32, 630, 12, 32), &candle_core::Device::Cpu)?);
    let b = Arc::new(Tensor::rand(0f32, 1.0, (32, 1, 1, 32), &candle_core::Device::Cpu)?);

    let start = Instant::now();
    (0..100).into_par_iter().for_each(|_| {
        let a_clone = a.clone();
        let b_clone = b.clone();
        let _ = a_clone.broadcast_add(&b_clone);
    });
    println!("broadcast add with Rayon: {:?}", Instant::now() - start);

    Ok(())
}

rayon version:broadcast add with Rayon(100x): 781.284023ms

However, such a coarse level of parallelism doesn't make practical sense. Could you tell me if there is an API or another method that could make Candle Tensor achieve the same efficiency as Torch in this scenario, and how does Torch internally implement tensor.broadcast_add? Thank you very much. @LaurentMazare