Open cpietsch opened 1 year ago
I believe because this is the unquantized version, if you compress it you will get better pref
Hi @cpietsch! It sounds to me as if the model was running on CPU only. Could you maybe try to run it again with the "GPU History" window of Activity Monitor open at the same time? It should show very clear GPU activity if it's in use.
Also, what computer are you using?
Hi @pcuenca, I am running it on an Apple M1 Pro with 16 GB and osx 13.4.1 I checked the perf history and it actually does not show significant activity on the GPU and CPU.
Interesting remember that Activity Monitor does not show Neural Engine, perhaps, https://github.com/tlkh/asitop, could provide more insight
My suspicion is that the computer is swapping because of memory pressure.
Same experience. I have a Macbook Pro M1 Max with 32GB of RAM, and I get 0.39 tokens/s. It's even worse with Falcon 7b.
I believe because this is the unquantized version, if you compress it you will get better pref
maybe we need to convert the model ourselves. but 0.39 t/s is not that bad...
Whelp, just closing all other apps, restarting, and running the SwiftChat build without Xcode has resulted in 4.96 tokens/s. Woohoo!
so @pcuenca was right with the memory pressure
Here are some profiling images which show a low workload:
It seams that other have the same problem
I have a same problem. One thing I don't understand is I was able to get fast response using [ollama](https://ollama.ai)
. Any idea why? I can see that the default model used in ollama
is the 7b model 🤔
Nice, ollama worked for me too right out of the box.
I tried to convert llama2 for the swift-chat myself with python -m exporters.coreml -m=./Llama-2-7b-hf --quantize=float16 --compute_units=cpu_and_gpu ll
but it always crashes without error after around 15 minutes. 🤔
I am experiencing the same issue on MBP 16 inch. Do you have any updates?
I am encountering a similar issue while utilizing a MacBook M2 with 32GB of RAM. It appears that the system may be engaging in swapping due to elevated memory pressure. I would greatly appreciate any insights or recommendations you might have for optimizing and mitigating the memory footprint in this context.
hi guys,any update for this question? I met the same issue on my M3 mbp
I downloaded the https://huggingface.co/coreml-projects/Llama-2-7b-chat-coreml model and compiled the chat with xcode. When running the example prompt it takes around 15 minutes to complete. I am not sure what I did wrong, but the performance should be better right ?
2023-08-09 12:01:55.346753+0200 SwiftChat[27414:583595] Metal API Validation Enabled