huggingface / swift-chat

Mac app to demonstrate swift-transformers
Apache License 2.0
470 stars 35 forks source link

a bit slow on my mbp 16 m1 #3

Open cpietsch opened 1 year ago

cpietsch commented 1 year ago

I downloaded the https://huggingface.co/coreml-projects/Llama-2-7b-chat-coreml model and compiled the chat with xcode. When running the example prompt it takes around 15 minutes to complete. I am not sure what I did wrong, but the performance should be better right ? 2023-08-09 12:01:55.346753+0200 SwiftChat[27414:583595] Metal API Validation Enabled

jsj commented 1 year ago

I believe because this is the unquantized version, if you compress it you will get better pref

pcuenca commented 1 year ago

Hi @cpietsch! It sounds to me as if the model was running on CPU only. Could you maybe try to run it again with the "GPU History" window of Activity Monitor open at the same time? It should show very clear GPU activity if it's in use.

Also, what computer are you using?

cpietsch commented 1 year ago

Hi @pcuenca, I am running it on an Apple M1 Pro with 16 GB and osx 13.4.1 I checked the perf history and it actually does not show significant activity on the GPU and CPU.

Screenshot 2023-08-09 at 17 21 59
jsj commented 1 year ago

Interesting remember that Activity Monitor does not show Neural Engine, perhaps, https://github.com/tlkh/asitop, could provide more insight

pcuenca commented 1 year ago

My suspicion is that the computer is swapping because of memory pressure.

awmartin commented 1 year ago

Same experience. I have a Macbook Pro M1 Max with 32GB of RAM, and I get 0.39 tokens/s. It's even worse with Falcon 7b.

swift-chat-llama-2-slow
cpietsch commented 1 year ago

I believe because this is the unquantized version, if you compress it you will get better pref

maybe we need to convert the model ourselves. but 0.39 t/s is not that bad...

awmartin commented 1 year ago

Whelp, just closing all other apps, restarting, and running the SwiftChat build without Xcode has resulted in 4.96 tokens/s. Woohoo!

cpietsch commented 1 year ago

so @pcuenca was right with the memory pressure

cpietsch commented 1 year ago

Here are some profiling images which show a low workload:

Screenshot 2023-08-21 at 12 47 24 Screenshot 2023-08-21 at 12 34 28

It seams that other have the same problem

longseespace commented 1 year ago

I have a same problem. One thing I don't understand is I was able to get fast response using [ollama](https://ollama.ai). Any idea why? I can see that the default model used in ollama is the 7b model 🤔

cpietsch commented 1 year ago

Nice, ollama worked for me too right out of the box. I tried to convert llama2 for the swift-chat myself with python -m exporters.coreml -m=./Llama-2-7b-hf --quantize=float16 --compute_units=cpu_and_gpu ll but it always crashes without error after around 15 minutes. 🤔

markwitt1 commented 1 year ago

I am experiencing the same issue on MBP 16 inch. Do you have any updates?

matiasvillaverde commented 1 year ago

I am encountering a similar issue while utilizing a MacBook M2 with 32GB of RAM. It appears that the system may be engaging in swapping due to elevated memory pressure. I would greatly appreciate any insights or recommendations you might have for optimizing and mitigating the memory footprint in this context.

AndreaChiChengdu commented 7 months ago

hi guys,any update for this question? I met the same issue on my M3 mbp