Open zackshen opened 8 months ago
Hey there! Thanks for reporting this and providing lots of detail :)
The issue here is that the version of GGML we use doesn’t support a specific operation required for feeding more than one token at a time with Metal (i.e. this works fine with CUDA, not Metal). See also #403.
This has been fixed in upstream GGML/llama.cpp, but we haven’t integrated that fix yet. The work has started in #428 and that should hopefully be finished within the next week (I’m out of town but I hope to get back to it soon).
Hope that helps clarify the state of affairs!
I'm very happy to hear this news and looking forward to the merged version. Thank you for your work.
Can I wait until after the release to close this issue?
hello @philpax has there been any recent movement on this?
I started working on it, but realised that it would end up being quite a large task. Still working on it, but it'll take some time.
thanks
LLM is indeed a fantastic library and very easy to use. However, after using LLM for a few days, I noticed that the process of
feed_prompt
is always very slow. It consumes a significant amount of CPU resources and doesn't utilize GPU resources (I found in the hardware acceleration documentation thatfeed_prompt
currently doesn't use GPU resources). As a result, if I add some context during the conversation, it takes a long time to wait for feed_prompt to complete, which is not ideal for the actual user experience. I used TheBloke/Llama-2-7B-Chat-GGML/llama-2-7b-chat.ggmlv3.q2_K.bin for testing.Using the same model and prompt, I tested with
llama.cpp
, and its first token response time is very fast. I'm not sure what the difference is in thefeed_prompt
process betweenllm
andllama.cpp
. By observing CPU history and GPU history,It seems likellama.cpp
is fully utilizing the GPU for inference.Can you please help me identify what's wrong?
Model:
System:
llama.cpp command:
llama.cpp Result:
llm sample code:
llm sample code result: