Open zwpaper opened 9 months ago
I was able to run it on M1 Mac Max (64GB RAM) laptop. 80GB should be more than sufficient to run it.
Did you run quantized version ?
On Mon, 19 Feb 2024 at 11:19 AM, Wei Zhang @.***> wrote:
I found the mixtral example in this repo, and try to run it on A100 80GB, but the default Mixtral-8x7B-v0.1 runs out of memory.
I was curious what GPU can run it in one card?
— Reply to this email directly, view it on GitHub https://github.com/huggingface/candle/issues/1733, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXGU4AYOKLBX2GN4NDDWQTYULRWJAVCNFSM6AAAAABDO22ZFGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGE2DCNJWGA3DIMI . You are receiving this because you are subscribed to this thread.Message ID: @.***>
I don't think 80GB should be enough to run the non quantized version, it has ~56B weights so using bfloat16 these require more than 100GB of memory and so won't fit on a single gpu. The quantized versions should be fine although they don't use the GPU on cuda at the moment.
Ah this runs my 192 gig M2 Ultra out of ram :/ It seems crazy loading 100+ till it breaks for me. How do we use the quantized version exactly? I didn't realize it was running the full model but starting suspecting that. This sounds hopeful, was worried it was broken which I really love Mixtral :)
Update: I see quantized example that seems to do all of them quantized, is that the path and the mixtral example only useful for full model usage?
Update: fails to run quantized with mixtral, how do we do that? I get this odd error on metal...
chris@earth candle % cargo run --example quantized --release --features metal -- --prompt 'how are you?' --model /Volumes/BrahmaSSD/LLM/models/GGUF/mixtral-8x7b-v0.1.Q5_0.gguf --which mixtral
Finished release [optimized] target(s) in 0.15s
Running `target/release/examples/quantized --prompt 'how are you?' --model /Volumes/BrahmaSSD/LLM/models/GGUF/mixtral-8x7b-v0.1.Q5_0.gguf --which mixtral`
avx: false, neon: true, simd128: false, f16c: false
temp: 0.80 repeat-penalty: 1.10 repeat-last-n: 64
loaded 995 tensors (32.23GB) in 0.07s
zsh: segmentation fault cargo run --example quantized --release --features metal -- --prompt --model
chris@earth candle % cargo run --example quantized --release --features metal -- --prompt 'how are you?' --which mixtral
Finished release [optimized] target(s) in 0.14s
Running `target/release/examples/quantized --prompt 'how are you?' --which mixtral`
avx: false, neon: true, simd128: false, f16c: false
temp: 0.80 repeat-penalty: 1.10 repeat-last-n: 64
loaded 995 tensors (26.44GB) in 0.07s
model built
Error: device mismatch in matmul, lhs: Metal { gpu_id: 4294968481 }, rhs: Cpu
how are you?%
chris@earth candle %
Yet works on CPU it seems without metal build, is that expected? seems fast but is hitting my CPU instead of GPU unlike llama.cpp does with mixtral. Also how would one get the Dolphin version running, is that hard? It is really good at chat.
It’s in the quantized folder inside examples. Just choose mixtral as default model in the command line and it should work.
On Tue, 20 Feb 2024 at 11:43 AM, Chris Kennedy @.***> wrote:
Ah this runs my 192 gig M2 Ultra out of ram :/ It seems crazy loading 100+ till it breaks for me. How do we use the quantized version exactly? I didn't realize it was running the full model but starting suspecting that. This sounds hopeful, was worried it was broken which I really love Mixtral :)
— Reply to this email directly, view it on GitHub https://github.com/huggingface/candle/issues/1733#issuecomment-1953550667, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXGU4AKEKA2DKTTSJB4WV3YUQ5H7AVCNFSM6AAAAABDO22ZFGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJTGU2TANRWG4 . You are receiving this because you commented.Message ID: @.***>
It seems to have this issue on Metal?
With Metal failing
cargo run --example quantized --release --features metal -- --prompt 'how are you?' --which mixtral-instruct --gqa 8 -n 300
Finished release [optimized] target(s) in 0.22s
Running `target/release/examples/quantized --prompt 'how are you?' --which mixtral-instruct --gqa 8 -n 300`
avx: false, neon: true, simd128: false, f16c: false
temp: 0.80 repeat-penalty: 1.10 repeat-last-n: 64
loaded 995 tensors (26.44GB) in 0.07s
model built
Error: device mismatch in matmul, lhs: Metal { gpu_id: 4294968481 }, rhs: Cpu
how are you?%
Working without Metal
chris@earth candle % cargo run --example quantized --release -- --prompt 'how are you?' --which mixtral-instruct --gqa 8 -n 300
Finished release [optimized] target(s) in 0.16s
Running `target/release/examples/quantized --prompt 'how are you?' --which mixtral-instruct --gqa 8 -n 300`
avx: false, neon: true, simd128: false, f16c: false
temp: 0.80 repeat-penalty: 1.10 repeat-last-n: 64
Running on CPU, to run on GPU(metal), build this example with `--features metal`
loaded 995 tensors (26.44GB) in 0.06s
model built
how are you?
i am fine, thank you for asking.
The Error: device mismatch in matmul, lhs: Metal { gpu_id: 4294968481 }, rhs: Cpu
Seems to happen consistently for me when using Metal for any of the mixtral, yet mistral models do work great for me on Metal.
Could you try running this with RUST_BACKTRACE=1
, this would help pinpointing at where the issues actually is.
I tried unquantized Mixtral example on my M3 Max 128G, but also ran out of memory. MLX's Mixtral example works fine. Maybe candle
allocates buffers twice (mmap + gpu) because Metal's new_buffer_with_bytes_no_copy
is not used?
EDIT: Nevermind. The Mixtral example converts BF16 to F32 for Metal, so needs 180G mem then.
let dtype = if device.is_cuda() {
DType::BF16
} else {
DType::F32
};
I'm also encountering this. Running the Q4_K_M version of this model: https://huggingface.co/TheBloke/dolphin-2.7-mixtral-8x7b-GGUF/blob/main/dolphin-2.7-mixtral-8x7b.Q4_K_M.gguf
Full stack trace:
thread 'main' panicked at inference_demo/src/main.rs:178:11:
called Result::unwrap()
on an Err
value: device mismatch in matmul, lhs: Metal { gpu_id: 4294969767 }, rhs: Cpu
0: std::backtrace_rs::backtrace::libunwind::trace
at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/../../backtrace/src/backtrace/libunwind.rs:104:5
1: std::backtrace_rs::backtrace::trace_unsynchronized
at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
2: std::backtrace::Backtrace::create
at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/backtrace.rs:331:13
3: candle_core::error::Error::bt
at /Users/kurt/.cargo/registry/src/index.crates.io-6f17d22bba15001f/candle-core-0.4.1/src/error.rs:227:25
4: candle_core::storage::Storage::same_device
at /Users/kurt/.cargo/registry/src/index.crates.io-6f17d22bba15001f/candle-core-0.4.1/src/storage.rs:49:17
5: candle_core::storage::Storage::matmul
at /Users/kurt/.cargo/registry/src/index.crates.io-6f17d22bba15001f/candle-core-0.4.1/src/storage.rs:655:9
6: candle_core::tensor::Tensor::matmul
at /Users/kurt/.cargo/registry/src/index.crates.io-6f17d22bba15001f/candle-core-0.4.1/src/tensor.rs:1169:23
7:
That's odd, I just ran the model you mentioned on cuda without any issue (and cuda / metal should have similar behavior when it comes to such errors).
Is it possible that your code in inference_demo
is not moving the input tensors to the metal device and keeps them on cpu?
How would I check that?
There is no implicit device in candle so you explicitely set the device each time you create a tensor or the device is inferred from the argument in the case of operations on tensors. You would want to check that the tensor inputs are appropriately sent on your device as is done here for example. You would also want to check that you're loading the model on the same device. My guess is that somewhere in your code you may have some hardcoded Device::Cpu
.
You can also print the device of your tensor so that it's easier to narrow done where this is coming from.
The hardcoded Device::Cpu
was actually in the quantized matmul
implementation of candle-core v4.1
, which is fixed in the latest release, so at least that one is solved 😄
I found the mixtral example in this repo, and try to run it on A100 80GB, but the default Mixtral-8x7B-v0.1 runs out of memory.
I was curious what GPU can run it in one card?