Open 3Simplex opened 3 months ago
Experiencing same issue.
Same issue here: GPU RX 6800 -> 1.0t/s, 8 core CPU -> 7.5t/s.
When I installed Llama 3.1 8B Instruct 128k it was very slow on my GPU RX 7600 too, less than 3t/s. After hours or a day later, I get 11 to 13t/s on it. But my CPU a 7700X gets 12/ts. I recently deleted config and data folder and reinstalled through Flatpak. No old cache files.
Just for comparison with previous model Llama 3 8B Instruct I get 41t/s on the GPU and 12t/s on CPU.
I also have same issue. I get 7 tokens/sec for vulkan. If I change to CUDA, I get around 80 tokens/sec. I have cpu intel 14900K, gpu nvidia 4900, 128 gb ram.
Am experiencing same issues on RX 6800 Llama 3 8B -> around 50 tokens/sec Llama 3.1 8B instruct 128k -> around 5 tokens/sec
I report the same problem, even with version 3.2.0 (it was already present in version 3.1.1). In my system there is an Intel 13500, an AMD RX 7800 XT and 32 GB of RAM (DDR4)
For anyone who has not tried this, I added a workaround to the main post.
[!NOTE] Until this is fixed the workaround is use the CPU instead.
Confirming very slow token generation for Llama 3.1 128k instruct on nvidia Ada 2000 (i7 13th gen Intel cpu).
@3Simplex @birrozza Which quantization model did you all use when using Vulkan, because not all Quantization models are not supported as far as I rembember.
Which quantization model did you all use when using Vulkan, because not all Quantization models are not supported as far as I rembember.
Hi The model I used is the one that can be downloaded via the application, so the quantization is Q4_0. Is there a possibility to download this model with other Quantization? Thanks
@birrozza I assume you mean Meta LLama.
Then please try all models with Q4 Quantization: https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF
I mean download: Q4_K_L Q4_K_M Q4_K_S IQ4_XS
Try them out and tell me if they work. If they don't, then yes, this is 100% AMD VULKAN problem.
@birrozza I assume you mean Meta LLama.
Then please try all models with Q4 Quantization: https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF
I mean download: Q4_K_L Q4_K_M Q4_K_S IQ4_XS
Try them out and tell me if they work. If they don't, then yes, this is 100% AMD VULKAN problem.
This is unnecessary, we are aware of the issue. I posted this thread because while we were testing the model, this was discovered and needed validation. It has been validated. What you are suggesting will not test Vulkan because Vulkan only uses Q4_0 Q4_1 and f16. We already know CUDA works as expected with the model, CUDA will use any quantization as does CPU.
Expected Behavior
Using the model with llama.cpp directly reports over 60t/s Using the model with GPT4All before 3.1.1 I could get about 30t/s.
Your Environment