huggingface / candle

Minimalist ML framework for Rust
Apache License 2.0
15.97k stars 966 forks source link

question: what GPU can run the mixtral example? #1733

Open zwpaper opened 9 months ago

zwpaper commented 9 months ago

I found the mixtral example in this repo, and try to run it on A100 80GB, but the default Mixtral-8x7B-v0.1 runs out of memory.

I was curious what GPU can run it in one card?

okpatil4u commented 9 months ago

I was able to run it on M1 Mac Max (64GB RAM) laptop. 80GB should be more than sufficient to run it.

Did you run quantized version ?

On Mon, 19 Feb 2024 at 11:19 AM, Wei Zhang @.***> wrote:

I found the mixtral example in this repo, and try to run it on A100 80GB, but the default Mixtral-8x7B-v0.1 runs out of memory.

I was curious what GPU can run it in one card?

— Reply to this email directly, view it on GitHub https://github.com/huggingface/candle/issues/1733, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXGU4AYOKLBX2GN4NDDWQTYULRWJAVCNFSM6AAAAABDO22ZFGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGE2DCNJWGA3DIMI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

LaurentMazare commented 9 months ago

I don't think 80GB should be enough to run the non quantized version, it has ~56B weights so using bfloat16 these require more than 100GB of memory and so won't fit on a single gpu. The quantized versions should be fine although they don't use the GPU on cuda at the moment.

groovybits commented 9 months ago

Ah this runs my 192 gig M2 Ultra out of ram :/ It seems crazy loading 100+ till it breaks for me. How do we use the quantized version exactly? I didn't realize it was running the full model but starting suspecting that. This sounds hopeful, was worried it was broken which I really love Mixtral :)

Update: I see quantized example that seems to do all of them quantized, is that the path and the mixtral example only useful for full model usage?

Update: fails to run quantized with mixtral, how do we do that? I get this odd error on metal...

chris@earth candle % cargo run --example quantized --release --features metal -- --prompt 'how are you?' --model /Volumes/BrahmaSSD/LLM/models/GGUF/mixtral-8x7b-v0.1.Q5_0.gguf --which mixtral
    Finished release [optimized] target(s) in 0.15s
     Running `target/release/examples/quantized --prompt 'how are you?' --model /Volumes/BrahmaSSD/LLM/models/GGUF/mixtral-8x7b-v0.1.Q5_0.gguf --which mixtral`
avx: false, neon: true, simd128: false, f16c: false
temp: 0.80 repeat-penalty: 1.10 repeat-last-n: 64
loaded 995 tensors (32.23GB) in 0.07s
zsh: segmentation fault  cargo run --example quantized --release --features metal -- --prompt  --model
chris@earth candle % cargo run --example quantized --release --features metal -- --prompt 'how are you?'  --which mixtral
    Finished release [optimized] target(s) in 0.14s
     Running `target/release/examples/quantized --prompt 'how are you?' --which mixtral`
avx: false, neon: true, simd128: false, f16c: false
temp: 0.80 repeat-penalty: 1.10 repeat-last-n: 64
loaded 995 tensors (26.44GB) in 0.07s
model built
Error: device mismatch in matmul, lhs: Metal { gpu_id: 4294968481 }, rhs: Cpu
how are you?%
chris@earth candle %

Yet works on CPU it seems without metal build, is that expected? seems fast but is hitting my CPU instead of GPU unlike llama.cpp does with mixtral. Also how would one get the Dolphin version running, is that hard? It is really good at chat.

okpatil4u commented 9 months ago

It’s in the quantized folder inside examples. Just choose mixtral as default model in the command line and it should work.

On Tue, 20 Feb 2024 at 11:43 AM, Chris Kennedy @.***> wrote:

Ah this runs my 192 gig M2 Ultra out of ram :/ It seems crazy loading 100+ till it breaks for me. How do we use the quantized version exactly? I didn't realize it was running the full model but starting suspecting that. This sounds hopeful, was worried it was broken which I really love Mixtral :)

— Reply to this email directly, view it on GitHub https://github.com/huggingface/candle/issues/1733#issuecomment-1953550667, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXGU4AKEKA2DKTTSJB4WV3YUQ5H7AVCNFSM6AAAAABDO22ZFGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJTGU2TANRWG4 . You are receiving this because you commented.Message ID: @.***>

groovybits commented 9 months ago

It seems to have this issue on Metal?

With Metal failing

cargo run --example quantized --release --features metal -- --prompt 'how are you?'  --which mixtral-instruct --gqa 8 -n 300
    Finished release [optimized] target(s) in 0.22s
     Running `target/release/examples/quantized --prompt 'how are you?' --which mixtral-instruct --gqa 8 -n 300`
avx: false, neon: true, simd128: false, f16c: false
temp: 0.80 repeat-penalty: 1.10 repeat-last-n: 64
loaded 995 tensors (26.44GB) in 0.07s
model built
Error: device mismatch in matmul, lhs: Metal { gpu_id: 4294968481 }, rhs: Cpu
how are you?%

Working without Metal

chris@earth candle % cargo run --example quantized --release -- --prompt 'how are you?'  --which mixtral-instruct --gqa 8 -n 300
    Finished release [optimized] target(s) in 0.16s
     Running `target/release/examples/quantized --prompt 'how are you?' --which mixtral-instruct --gqa 8 -n 300`
avx: false, neon: true, simd128: false, f16c: false
temp: 0.80 repeat-penalty: 1.10 repeat-last-n: 64
Running on CPU, to run on GPU(metal), build this example with `--features metal`
loaded 995 tensors (26.44GB) in 0.06s
model built
how are you?

i am fine, thank you for asking.

The Error: device mismatch in matmul, lhs: Metal { gpu_id: 4294968481 }, rhs: Cpu

Seems to happen consistently for me when using Metal for any of the mixtral, yet mistral models do work great for me on Metal.

LaurentMazare commented 9 months ago

Could you try running this with RUST_BACKTRACE=1, this would help pinpointing at where the issues actually is.

cloneable commented 8 months ago

I tried unquantized Mixtral example on my M3 Max 128G, but also ran out of memory. MLX's Mixtral example works fine. Maybe candle allocates buffers twice (mmap + gpu) because Metal's new_buffer_with_bytes_no_copy is not used?

EDIT: Nevermind. The Mixtral example converts BF16 to F32 for Metal, so needs 180G mem then.

    let dtype = if device.is_cuda() {
        DType::BF16
    } else {
        DType::F32
    };
kurtbuilds commented 8 months ago

I'm also encountering this. Running the Q4_K_M version of this model: https://huggingface.co/TheBloke/dolphin-2.7-mixtral-8x7b-GGUF/blob/main/dolphin-2.7-mixtral-8x7b.Q4_K_M.gguf

Full stack trace:

thread 'main' panicked at inference_demo/src/main.rs:178:11: called Result::unwrap() on an Err value: device mismatch in matmul, lhs: Metal { gpu_id: 4294969767 }, rhs: Cpu 0: std::backtrace_rs::backtrace::libunwind::trace at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/../../backtrace/src/backtrace/libunwind.rs:104:5 1: std::backtrace_rs::backtrace::trace_unsynchronized at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5 2: std::backtrace::Backtrace::create at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/backtrace.rs:331:13 3: candle_core::error::Error::bt at /Users/kurt/.cargo/registry/src/index.crates.io-6f17d22bba15001f/candle-core-0.4.1/src/error.rs:227:25 4: candle_core::storage::Storage::same_device at /Users/kurt/.cargo/registry/src/index.crates.io-6f17d22bba15001f/candle-core-0.4.1/src/storage.rs:49:17 5: candle_core::storage::Storage::matmul at /Users/kurt/.cargo/registry/src/index.crates.io-6f17d22bba15001f/candle-core-0.4.1/src/storage.rs:655:9 6: candle_core::tensor::Tensor::matmul at /Users/kurt/.cargo/registry/src/index.crates.io-6f17d22bba15001f/candle-core-0.4.1/src/tensor.rs:1169:23 7: ::forward at /Users/kurt/.cargo/registry/src/index.crates.io-6f17d22bba15001f/candle-core-0.4.1/src/quantized/mod.rs:487:17 8: candle_transformers::models::quantized_llama::QMatMul::forward at /Users/kurt/.cargo/registry/src/index.crates.io-6f17d22bba15001f/candle-transformers-0.4.1/src/models/quantized_llama.rs:46:9 9: ::forward at /Users/kurt/.cargo/registry/src/index.crates.io-6f17d22bba15001f/candle-transformers-0.4.1/src/models/quantized_llama.rs:86:37 10: candle_transformers::models::quantized_llama::ModelWeights::forward at /Users/kurt/.cargo/registry/src/index.crates.io-6f17d22bba15001f/candle-transformers-0.4.1/src/models/quantized_llama.rs:510:21 11: inference_demo::TextGeneration::run at ./src/main.rs:55:26 12: inference_demo::main at ./src/main.rs:156:5

LaurentMazare commented 8 months ago

That's odd, I just ran the model you mentioned on cuda without any issue (and cuda / metal should have similar behavior when it comes to such errors). Is it possible that your code in inference_demo is not moving the input tensors to the metal device and keeps them on cpu?

kurtbuilds commented 8 months ago

How would I check that?

LaurentMazare commented 8 months ago

There is no implicit device in candle so you explicitely set the device each time you create a tensor or the device is inferred from the argument in the case of operations on tensors. You would want to check that the tensor inputs are appropriately sent on your device as is done here for example. You would also want to check that you're loading the model on the same device. My guess is that somewhere in your code you may have some hardcoded Device::Cpu. You can also print the device of your tensor so that it's easier to narrow done where this is coming from.

somethingelseentirely commented 5 months ago

The hardcoded Device::Cpu was actually in the quantized matmul implementation of candle-core v4.1, which is fixed in the latest release, so at least that one is solved 😄