Open cphoward opened 11 months ago
After playing around with models, I've found something like
def prepare_preprompt_kv_cache(preprompt):
inputs = tokenizer(preprompt, return_tensors="np", add_special_tokens=False)
model_inputs = {
"input_ids": inputs["input_ids"],
"attention_mask": inputs["attention_mask"]
}
# Generate position ids based on the length of the input
seq_length = inputs["input_ids"].shape[1]
model_inputs["position_ids"] = np.arange(seq_length)[None, :]
# Initialize past key values for each layer
for i in range(32):
model_inputs[f"past_key_values.{i}.key"] = np.zeros((1, 32, 0, 128), dtype=np.float32)
model_inputs[f"past_key_values.{i}.value"] = np.zeros((1, 32, 0, 128), dtype=np.float32)
return client.predict(inputs=model_inputs, model_name='llama')
won't crash if I also change the PREPROMPT
to something relatively short. It crashes when attempting to run with the default PREPROMT:
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.INTERNAL
details = "Internal inference error"
debug_error_string = "UNKNOWN:Error received from peer ipv6:%5B::1%5D:9000 {created_time:"2023-12-21T22:42:05.540244646+00:00", grpc_status:13, grpc_message:"Internal inference error"}"
The server logs give:
[2023-12-21 22:25:15.297][62][serving][error][modelinstance.cpp:1168] Async caught an exception Internal inference error: Exception from src/inference/src/infer_request.cpp:256:
Exception from src/inference/src/dev/converter_utils.cpp:707:
[ GENERAL_ERROR ] Shape inference of Multiply node with name __module.model.layers.0.self_attn/aten::mul/Multiply failed: Exception from src/plugins/intel_cpu/src/shape_inference/custom/eltwise.cpp:47:
I have confirmed with a custom script I wrote that the model can do inference, but it's mostly gibberish and results in very few characters.
Hello @cphoward
We have recently removed the demo you refer to. However, please check the new version which uses new MediaPipe python calculator feature that makes it easier to serve llama: https://github.com/openvinotoolkit/model_server/tree/main/demos/python_demos/llm_text_generation
Please let us know if you have any feedback.
Describe the bug
I am attempting to run the LLaMA2 demo at https://github.com/openvinotoolkit/model_server/blob/main/demos/llama_chat/python/README.md. When I run:
I get
To Reproduce Steps to reproduce the behavior:
Expected behavior A clear and concise description of what you expected to happen.
I expected results similar to demo documentation
Configuration
Additional context I did install
nncf
forint8
compression. Is there a way to configure the example to useint4
compression?Update It seems the missing argument is
position_ids
.