I am using Pygmalion 2 7B 5q quantized gguf model on my MacBook Pro M1 14" with 16GB RAM and 512GB SSD. The model performs well with short sentences, but processing slows down significantly when I input around 512+ tokens, and even can crash in that scenario. However, when using Koboldcpp (Llamacpp based), it generates results quickly. Is there a way to improve the speed, such as utilizing BLAS or adjusting the number of GPU/CPU? Any suggestions?
I was beginning to feel frustrated about why it wasn't working, but if it's happening with other people as well, then perhaps it's the problem of the library itself 😂😮💨
I am using Pygmalion 2 7B 5q quantized gguf model on my MacBook Pro M1 14" with 16GB RAM and 512GB SSD. The model performs well with short sentences, but processing slows down significantly when I input around 512+ tokens, and even can crash in that scenario. However, when using Koboldcpp (Llamacpp based), it generates results quickly. Is there a way to improve the speed, such as utilizing BLAS or adjusting the number of GPU/CPU? Any suggestions?