Shouldn't q8 work in 3060/12GB?

jikkuatwork commented 1 day ago

Due diligence

[X] I have done my due diligence in trying to find the answer myself.

Topic

The Rust implementation

Question

System Config

Ubuntu 22
Rust (1.8)

nvcc

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

nvidia-smi


Wed Sep 18 23:30:35 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.107.02             Driver Version: 550.107.02     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3060        Off |   00000000:0A:00.0  On |                  N/A |
|  0%   46C    P8             15W /  170W |     840MiB /  12288MiB |      4%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 2039 G /usr/lib/xorg/Xorg 537MiB | | 0 N/A N/A 2269 G /usr/bin/gnome-shell 67MiB | | 0 N/A N/A 4224 G ...9d0e33034f2368c6ed2015474b1d818a902 206MiB | | 0 N/A N/A 9191 G alacritty 9MiB | | 0 N/A N/A 26191 G /home/HOME/Apps/Telegram/Telegram 4MiB | +-----------------------------------------------------------------------------------------+


## Observations

Tried: `cargo run --bin moshi-backend -r -- --config moshi-backend/config-q8.json standalone`

- The UI loads but the speed is [unacceptably slow](https://github.com/user-attachments/assets/b60f82d1-f78a-4c51-b789-a356f345b25e) and the voice is distorted
- `nvtop` shows that the model isn't loading

Tried: `cargo run --features cuda --bin moshi-backend -r -- --config moshi-backend/config-q8.json standalone`

- Loading model to GPU fails! (I thought 12GB was enough to load the 7GB GGUF? GPU hardly had 1GB used)

❮ cargo run --features cuda --bin moshi-backend -r -- --config moshi-backend/config-q8.json standalone warning: profiles for the non root package will be ignored, specify profiles at the workspace root: package: /home/HOME/Projects/outside_projects/moshi/rust/moshi-core/Cargo.toml workspace: /home/HOME/Projects/outside_projects/moshi/rust/Cargo.toml warning: profiles for the non root package will be ignored, specify profiles at the workspace root: package: /home/HOME/Projects/outside_projects/moshi/rust/moshi-backend/Cargo.toml workspace: /home/HOME/Projects/outside_projects/moshi/rust/Cargo.toml warning: profiles for the non root package will be ignored, specify profiles at the workspace root: package: /home/HOME/Projects/outside_projects/moshi/rust/moshi-cli/Cargo.toml workspace: /home/HOME/Projects/outside_projects/moshi/rust/Cargo.toml Finished release profile [optimized] target(s) in 0.23s Running target/release/moshi-backend --config moshi-backend/config-q8.json standalone 2024-09-18T18:20:02.612428Z INFO moshi_backend: build_info=BuildInfo { build_timestamp: "2024-09-18T16:57:00.763883182Z", build_date: "2024-09-18", git_branch: "main", git_timestamp: "2024-09-18T17:45:09.000000000+02:00", git_date: "2024-09-18", git_hash: "f3218c60a115b745b1848bb8297df5eb404a041a", git_describe: "f3218c6", rustc_host_triple: "x86_64-unknown-linux-gnu", rustc_version: "1.80.1", cargo_target_triple: "x86_64-unknown-linux-gnu" } 2024-09-18T18:20:02.612441Z INFO moshi_backend: starting process with pid 752759 2024-09-18T18:20:02.612457Z INFO hf_hub: Token file not found "/home/HOME/.cache/huggingface/token" 2024-09-18T18:20:02.682964Z INFO hf_hub: Token file not found "/home/HOME/.cache/huggingface/token" 2024-09-18T18:20:07.910280Z INFO moshi_backend::standalone: warming up the model Error: DriverError(CUDA_ERROR_OUT_OF_MEMORY, "out of memory") moshi/rust on  main [?] is 📦 v0.2.0 via 🦀 v1.80.1 took 6s

adefossez commented 7 hours ago

That is a good question, would have to look more into it. Maybe @LaurentMazare would have an opinion on this?

jikkuatwork commented 7 hours ago

Thanks a lot! Appreciate your time!

LaurentMazare commented 17 minutes ago

I cannot really test this at the moment but I think it's somewhat expected. The weights are ~8.17GB but when in q8 mode we pre-allocate a kv-cache that is for 4096 steps (~5 mins of conversation) in f32 - we should aim at using bf16 instead but that's likely to require some changes on the candle side, the kv-cache is ~4GB, and activations + the mimi parts also have to be stored but they should be pretty small. So overall we're a bit above 12GB here. One thing you could try is tweaking this line to be something like 1000 and see if it helps. You'll only be able to have short sessions with moshi but if it works we could consider making this configurable somehow.

kyutai-labs / moshi