guinmoon / LLMFarm

llama and other large language models on iOS and MacOS offline using GGML library.
https://llmfarm.site
MIT License
1.05k stars 62 forks source link

It's so slow #49

Closed KittenYang closed 3 months ago

KittenYang commented 3 months ago

LLMFarm speed:

截屏2024-03-13 01 37 29

Jan speed:

截屏2024-03-13 01 37 38
guinmoon commented 3 months ago

Can you tell me what your parameters are for the test? Model, device, prompt?

KittenYang commented 3 months ago

Model: https://huggingface.co/Qwen/Qwen1.5-7B-Chat-GGUF Prompt: Who are you? Device: M1 Mac Settings:

截屏2024-03-13 01 50 31
guinmoon commented 3 months ago

Very strange results. I have an old intel xeon for 30$ and on it without Metal I have this result. It should be much faster with Metal. Screenshot 2024-03-12 at 21 37 38 Screenshot 2024-03-12 at 21 37 03

KittenYang commented 3 months ago

I figure out why. When I disable the BOS in Prompt format section, it works like a charm. What's the BOS, EOS means? BTW the same model running on iPhone14 Pro 16.6.1, speed: 0.56token/s, is it normal? 截屏2024-03-13 10 54 34

guinmoon commented 3 months ago

BOS - adds the beginning of session token to the beginning of the message EOS - adds end of session token to the end of the message It used to be necessary to add these tokens to some models like LLaMA and Alpaca, now things have changed a bit. 7B models are too big for iphones below 15 pro due to not enough RAM, so will be very slow. You can try running q2_k and q3_ks with a small context size.

KittenYang commented 3 months ago

Thanks man, LLMFarm is really good!