Open akashicMarga opened 6 months ago
Hi, ca you pull the latest main branch and let me know if it's still happening? It seems like it isn't compiling with metal for you
same. not compiling for metal.
https://github.com/jafioti/luminal/assets/18519731/676d2f7d-3eeb-4605-964c-6f2c597b2e1e
Would you be able to set the num tokens generated to 1 and do execute_debug in the decoding loop? My guess is there is still some op taking 90% of the time. The debug printout will tell you the shape of time each op took
On discord you mentioned this is for the M3. Is it the M3 or M2 Pro?
There i mentioned. Macbook pro with just M2.
@akashicMarga What tool do you use to get the GPU diagnostic and memory usage on the right in your screenshots?
@akashicMarga I got my hands on a 16GB and tested it out. It's weird but turns out the memory usage isn't getting properly reported. Phi worked, but llama did not, and the memory usage was >9 gb before running luminal. So I think the issue still is that memory is running out and the model is getting kicked to swap, but it's not correctly reported.
Did you say you got candle or llama cpp running with Q8 llama on your machine?
lama-3 example running slow and not utilising metal GPU. it's mostly 0 and sometimes a spike of 20 or 35%.