Closed ibehnam closed 4 months ago
Hey @ibehnam -- do you mind pointing me to where you got the 4bit GGUF from? A helpful test would be to see if llama-cpp-python
can load the file, with something like the following code:
from llama_cpp import Llama
llm = Llama(
model_path="<placeholder>/LLM/models/phi-3-medium-4k-instruct.Q4_0.gguf",
logits_all=True,
n_gpu_layers=128,
n_ctx=4096,
)
And perhaps a quick test of a generation:
output = llm(
"Q: Name the planets in the solar system? A: ", # Prompt
max_tokens=32, # Generate up to 32 tokens, set to None to generate up to the end of the context window
stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
echo=True # Echo the prompt back in the output
) # Generate a completion, can also call create_completion
print(output)
A look at your stack trace suggests that the issue may be coming from the upstream repo we depend on to interface with llama cpp (https://github.com/abetlen/llama-cpp-python), but I'm happy to try to debug on our side too.
@Harsha-Nori Thanks so much for your response. I did what you suggested and got the same error using llama-cpp-python
. I'll dig more and try to find a workaround. I know llama.cpp can handle the new models (ollama runs phi-3-medium just fine), so it'll probably boil down to manually compiling llama.cpp for the llama-cpp-python package.
I can use Guidance with Phi-3-mini which was announced a while ago, but with the new ones (ϕ-3-medium class) I get: