huggingface / candle

Minimalist ML framework for Rust
Apache License 2.0
14.25k stars 799 forks source link

Metal memory leak multiplying matrices #2271

Open ealmloff opened 2 weeks ago

ealmloff commented 2 weeks ago

Running this code multiplying a 784x100 matrix times a 100x10 matrix seems to leak memory. The memory usage gradually increases to more than 5 gigabytes when running with the metal feature enabled in release mode on commit 2b10aaa:

use anyhow::Result;
use candle_core::{
    backend::BackendDevice,
    Device, MetalDevice, Tensor,
};

fn main() -> Result<()> {
    let device = Device::Metal(MetalDevice::new(0)?);
    let first = Tensor::randn(0f32, 1.0, (784, 100), &device)?;
    let second = Tensor::randn(0f32, 1.0, (100, 10), &device)?;
    loop {
        first.matmul(&second)?;
    }
}

On the CPU, memory usage stays steady at ~2 mb. This also seems to effect quantized matrix multiplication

In the full code this is minified from, I see memory usage increase from 15gb to >100gb when feeding in the same sized input to a bert model many times in a row

LaurentMazare commented 1 week ago

Thanks for reporting this and providing this short repro - I can reproduce the issue on my m2, I'll have a more in depth look (though it will most likely have to wait for a week or two).

LaurentMazare commented 1 week ago

Looks like the issue is related to the autorelease pool not releasing memory fast enough. I'm not very familiar with this but the following code seems to keep the memory usage under control (it might still be drifting but a lot slower than before). Will have to think a bit more about how to handle this in candle.

use anyhow::Result;
use candle_core::{backend::BackendDevice, Device, MetalDevice, Tensor};

fn main() -> Result<()> {
    let device = Device::Metal(MetalDevice::new(0)?);
    let first = Tensor::randn(0f32, 1.0, (784, 100), &device)?;
    let second = Tensor::randn(0f32, 1.0, (100, 10), &device)?;
    for i in 0.. {
        objc::rc::autoreleasepool(|| {
            first.matmul(&second).unwrap();
        })
    }
    Ok(())
}

[edit] slightly better example where we see the memory drifting then getting back to the proper range once the autorelease pool exits.

    loop {
        println!("here");
        objc::rc::autoreleasepool(|| {
            for i in 0..1000000 {
                first.matmul(&second).unwrap();
            }
        })
    }
ealmloff commented 1 week ago

Thanks for the workaround. I can confirm adding an autoreleasepool for batches in my bert code does fix the memory leak. Running my workload overnight, I don't seen any meaningful memory usage increase