jankais3r / LLaMA_MPS

Run LLaMA (and Stanford-Alpaca) inference on Apple Silicon GPUs.
GNU General Public License v3.0
583 stars 47 forks source link

Why fp16 MPS performance is worse than CPU? #15

Open FdyCN opened 1 year ago

FdyCN commented 1 year ago

In your conclusion. MPS performace is worse than llama.cpp cpu performance in the same fp16. Why? Is there any kernel which MPS doesn't support will fallback to CPU( so that hurt performace)?

you said this: image

i figure that you mean MPS shader is compiling Just-in-time so the performace is worse than A-head-of-time compiled CPU codes?Am i wrong?

jankais3r commented 1 year ago

Hi, I wish I could give you a definitive answer, but unfortunately I am not familiar enough with PyTorch' MPS implementation to be able to confirm or deny your theory...

philipturner commented 1 year ago

It's bandwidth. The model is bottlenecked by how quickly the processor can fetch weights from RAM. FP16 consumes 4x as many bits as Int4, and thus is 4x slower.