Open FdyCN opened 1 year ago
Hi, I wish I could give you a definitive answer, but unfortunately I am not familiar enough with PyTorch' MPS implementation to be able to confirm or deny your theory...
It's bandwidth. The model is bottlenecked by how quickly the processor can fetch weights from RAM. FP16 consumes 4x as many bits as Int4, and thus is 4x slower.
In your conclusion. MPS performace is worse than llama.cpp cpu performance in the same fp16. Why? Is there any kernel which MPS doesn't support will fallback to CPU( so that hurt performace)?
you said this:
i figure that you mean MPS shader is compiling Just-in-time so the performace is worse than A-head-of-time compiled CPU codes?Am i wrong?