Metal GPU not utilising efficiently on M2 pro

jafioti / luminal

Deep learning at the speed of light.

https://luminalai.com

Apache License 2.0

1.48k stars 89 forks source link

Metal GPU not utilising efficiently on M2 pro #54

Open akashicMarga opened 6 months ago

akashicMarga commented 6 months ago

Screenshot 2024-05-03 at 4 49 55 PM

lama-3 example running slow and not utilising metal GPU. it's mostly 0 and sometimes a spike of 20 or 35%.

jafioti commented 6 months ago

Hi, ca you pull the latest main branch and let me know if it's still happening? It seems like it isn't compiling with metal for you

akashicMarga commented 6 months ago

same. not compiling for metal.

https://github.com/jafioti/luminal/assets/18519731/676d2f7d-3eeb-4605-964c-6f2c597b2e1e

jafioti commented 6 months ago

Would you be able to set the num tokens generated to 1 and do execute_debug in the decoding loop? My guess is there is still some op taking 90% of the time. The debug printout will tell you the shape of time each op took

jafioti commented 6 months ago

On discord you mentioned this is for the M3. Is it the M3 or M2 Pro?

akashicMarga commented 6 months ago

There i mentioned. Macbook pro with just M2.

jafioti commented 5 months ago

@akashicMarga What tool do you use to get the GPU diagnostic and memory usage on the right in your screenshots?

akashicMarga commented 5 months ago

https://github.com/tlkh/asitop

jafioti commented 5 months ago

@akashicMarga I got my hands on a 16GB and tested it out. It's weird but turns out the memory usage isn't getting properly reported. Phi worked, but llama did not, and the memory usage was >9 gb before running luminal. So I think the issue still is that memory is running out and the model is getting kicked to swap, but it's not correctly reported.

Did you say you got candle or llama cpp running with Q8 llama on your machine?