oobabooga / text-generation-webui

A Gradio web UI for Large Language Models.
GNU Affero General Public License v3.0
39.69k stars 5.21k forks source link

0% utilization on 2nd nVidia GPU (GGUF)—only 2 tokens per second #6054

Open Kaszebe opened 4 months ago

Kaszebe commented 4 months ago

I have the following rig for AI:

I bought 2 PCIE 4.0 riser cables and connected my 4090 and a Gigabyte 4080 to the motherboard. I have the 4080 connected to an external Corsair 850W PSU.

My issue is that I'm only getting 2 tokens per second on: Meta-Llama-3-70B-Instruct-Q5_K_M.gguf and Oobabooga. 55/81 layers offloaded, 7936 n_ctx, 297 n_batch, and no-mmap and mlock both checked.

I'm also seeing 0% utilization on the 4080. The 4090 gets utilized, just not the 4080.

I have tensor_split set to 24,17 (which results in ~23 GB of VRAM used on the 4090 and ~14.5GB of VRAM used on the 4080).

Yes, I know this is a Q5 quant and I should be using a Q4 tops (if that) based on my measly 40GB of VRAM and 64GB of DDR5.

However, I need the model to be as intelligent as possible. I write copy for a living (corporate web pages) and need an AI that can follow along intelligently as we continually modify a 13-word value prop over the course of an hour.

Meta-Llama-3-70B-Instruct-Q5_K_M.gguf is working very well for me in terms of intelligence.

I have searched for hours online and come up with no solutions. I originally went into BIOS and changed it to 8x+8x but that didn't seem to do anything. So, I changed it back to "Automatic". I tested out PCIE 3.0, and that didn't do anything either.

I'm using the PCIe 5.0 x16 (From CPU) slot for the 4090 and the PCIe 4.0 x4 (From CPU) slot for the 4080.

Just wondering if 2 tokens per second is all I should expect based on everything I have told you so far? And is it normal for the 2nd GPU to have 0% utilization?

RodriMora commented 3 months ago

what is the power usage of each card during inference? Are you running windows? what version driver?

Kaszebe commented 3 months ago

what is the power usage of each card during inference? Are you running windows? what version driver?

Yes, running Windows 11 (fully updated) and the latest Nvidia driver. For power usage, I cannot tell which GPU is which. I loaded up Afterburner and it's showing GPU1, GPU2, GPU3.

I'm assuming GP1 and 2 are the 4090 and 4080 respectively. And GPU3 is the onboard graphics of the 7800x3d.

It appears as if there is some power draw (unsure if I'm even looking at the right metric in Afterburner) for GPU1 (4090?) and GPU2(4080?). GPU1 has the most power draw when I load up the LLM in Ooba and ask it a long question. However, in Windows task manager, the 4080 is labeled as "GPU1" and is showing no utilization.

Would it be easier to run/diagnose if I were running some flavor of Linux?

I'm willing to install Linux on my machine if that would be easier/faster to run vs. Windows.

Edit: I base my Linux comment off this thread: https://www.reddit.com/r/LocalLLaMA/comments/137xn93/update_yeah_linux_is_a_lot_faster/

Kaszebe commented 3 months ago

Update: I got bored and decided to install Ubuntu as a dual boot with Win 11. Took a few but finally got everything working (I think?) and OobaBooga recognized both GPUs. Downloaded the same Q5 quant as above and I'm now getting 3 tokens per second and it's taking about 1-2 seconds of time to first token.

No clue why OobaBooga is faster on Ubuntu than Windows 11. But here we are.

I think I downloaded all the necessary CUDA stuff (followed an online guide). No clue. But everything seems to work.