Closed KittenYang closed 3 months ago
Can you tell me what your parameters are for the test? Model, device, prompt?
Model: https://huggingface.co/Qwen/Qwen1.5-7B-Chat-GGUF Prompt: Who are you? Device: M1 Mac Settings:
Very strange results. I have an old intel xeon for 30$ and on it without Metal I have this result. It should be much faster with Metal.
I figure out why. When I disable the BOS in Prompt format
section, it works like a charm. What's the BOS, EOS means?
BTW the same model running on iPhone14 Pro 16.6.1, speed: 0.56token/s, is it normal?
BOS - adds the beginning of session token
to the beginning of the message
EOS - adds end of session token
to the end of the message
It used to be necessary to add these tokens to some models like LLaMA and Alpaca, now things have changed a bit.
7B models are too big for iphones below 15 pro due to not enough RAM, so will be very slow.
You can try running q2_k and q3_ks with a small context size.
Thanks man, LLMFarm is really good!
LLMFarm speed:
Jan speed: