Getting cuda out of memory running on Rust q8 with RTX 4060 8g vram

dewrama commented 1 month ago

Due diligence

[X] I have done my due diligence in trying to find the answer myself.

Topic

The Rust implementation

Question

I am getting cuda out of memory. I am running q8 version on wsl, Ubuntu, RTX 4060 with 8g vram. I thought the hardware could run the quantized version. Am I doing something wrong? Please help. (I also tried cuda_compute_cap with other lower numbers and still same problem)

CUDA_COMPUTE_CAP=86 cargo run --features cuda --bin moshi-backend -r -- --co nfig moshi-backend/config-q8.json standalone

Finished release profile [optimized + debuginfo] target(s) in 1m 15s Running target/release/moshi-backend --config moshi-backend/config-q8.json standalone 2024-09-29T20:03:12.168129Z INFO moshi_backend: build_info=BuildInfo { build_timestamp: "2024-09-22T23:05:21.856959080Z", build_date: "2024-09-22", git_branch: "main", git_timestamp: "2024-09-21T17:30:23.000000000+02:00", git_date: "2024-09-21", git_hash: "3e3e573b28a1d1d6be084185e1a2e6e550c1ddcf", git_describe: "3e3e573", rustc_host_triple: "x86_64-unknown-linux-gnu", rustc_version: "1.81.0", cargo_target_triple: "x86_64-unknown-linux-gnu" } 2024-09-29T20:03:12.168212Z INFO moshi_backend: starting process with pid 30709

Error: DriverError(CUDA_ERROR_OUT_OF_MEMORY, "out of memory")

LaurentMazare commented 1 month ago

The current version is too large for a 8GB GPU, see the faq.

dewrama commented 1 month ago

Thanks for reply, I saw the recently updated FAQ. Being curious and creative, is there any way to work around this vram limitation such as using nvidia unified memory? I also read that the new intel Ultra cpu can use Arc to offload to memory. A lot of us have limited hardware (or almost able to run) and it would be great if we can all use a scaled down version. Thanks!

LaurentMazare commented 1 month ago

I cannot think of a very easy way to get around this, we have a q4 quantized version that can work on 12GB or even 8GB but I find it to be actually quite worse quality than the original one so wouldn't recommend going this way. Would certainly be great if some alternative implementations emerge in the community and improve on memory requirements etc.

kyutai-labs / moshi