Vulkan: Meta-Llama-3.1-8b-128k slow generation.

nomic-ai / gpt4all

GPT4All: Run Local LLMs on Any Device. Open-source and available for commercial use.

https://nomic.ai/gpt4all

MIT License

70.72k stars 7.71k forks source link

Vulkan: Meta-Llama-3.1-8b-128k slow generation. #2768

Open 3Simplex opened 3 months ago

3Simplex commented 3 months ago

[!NOTE] Until this is fixed the workaround is use the CPU or CUDA instead.

Bug Report

Vulkan: Meta-Llama-3.1-8b-128k slow generation.

When using release 3.1.1 and Vulkan the Meta-Llama-3.1-8b-128k is extremely slow. (1.5t/s) This is not a problem on CPU.

Steps to Reproduce

Using GPT4All 3.1.1 with Vulkan
Chat with Meta-Llama-3.1-8b-128k
Speed Is immediately slow (1.5t/s)

Expected Behavior

Using the model with llama.cpp directly reports over 60t/s Using the model with GPT4All before 3.1.1 I could get about 30t/s.

Your Environment

GPT4All version: 3.1.1 (release or web_beta)
Operating System: Windows
Chat model used (if applicable): Vulkan & Meta-Llama-3.1-8b-128k

karmakiro commented 3 months ago

Experiencing same issue.

NextTechAndAI commented 3 months ago

Same issue here: GPU RX 6800 -> 1.0t/s, 8 core CPU -> 7.5t/s.

thingsiplay commented 3 months ago

When I installed Llama 3.1 8B Instruct 128k it was very slow on my GPU RX 7600 too, less than 3t/s. After hours or a day later, I get 11 to 13t/s on it. But my CPU a 7700X gets 12/ts. I recently deleted config and data folder and reinstalled through Flatpak. No old cache files.

Just for comparison with previous model Llama 3 8B Instruct I get 41t/s on the GPU and 12t/s on CPU.

fmatak commented 3 months ago

I also have same issue. I get 7 tokens/sec for vulkan. If I change to CUDA, I get around 80 tokens/sec. I have cpu intel 14900K, gpu nvidia 4900, 128 gb ram.

herwe commented 3 months ago

Am experiencing same issues on RX 6800 Llama 3 8B -> around 50 tokens/sec Llama 3.1 8B instruct 128k -> around 5 tokens/sec

birrozza commented 3 months ago

I report the same problem, even with version 3.2.0 (it was already present in version 3.1.1). In my system there is an Intel 13500, an AMD RX 7800 XT and 32 GB of RAM (DDR4)

3Simplex commented 3 months ago

For anyone who has not tried this, I added a workaround to the main post.

[!NOTE] Until this is fixed the workaround is use the CPU instead.

mshakirDr commented 3 months ago

Confirming very slow token generation for Llama 3.1 128k instruct on nvidia Ada 2000 (i7 13th gen Intel cpu).

jensdraht1999 commented 3 months ago

@3Simplex @birrozza Which quantization model did you all use when using Vulkan, because not all Quantization models are not supported as far as I rembember.

birrozza commented 3 months ago

Which quantization model did you all use when using Vulkan, because not all Quantization models are not supported as far as I rembember.

Hi The model I used is the one that can be downloaded via the application, so the quantization is Q4_0. Is there a possibility to download this model with other Quantization? Thanks

jensdraht1999 commented 3 months ago

@birrozza I assume you mean Meta LLama.

Then please try all models with Q4 Quantization: https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF

I mean download: Q4_K_L Q4_K_M Q4_K_S IQ4_XS

Try them out and tell me if they work. If they don't, then yes, this is 100% AMD VULKAN problem.

3Simplex commented 3 months ago

@birrozza I assume you mean Meta LLama.

Then please try all models with Q4 Quantization: https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF

I mean download: Q4_K_L Q4_K_M Q4_K_S IQ4_XS

Try them out and tell me if they work. If they don't, then yes, this is 100% AMD VULKAN problem.

This is unnecessary, we are aware of the issue. I posted this thread because while we were testing the model, this was discovered and needed validation. It has been validated. What you are suggesting will not test Vulkan because Vulkan only uses Q4_0 Q4_1 and f16. We already know CUDA works as expected with the model, CUDA will use any quantization as does CPU.