openvinotoolkit / model_server

A scalable inference server for models optimized with OpenVINO™
https://docs.openvino.ai/2024/ovms_what_is_openvino_model_server.html
Apache License 2.0
675 stars 212 forks source link

LLaMA2 Model Serving Chat Demo Errors on Invalid number of inputs #2218

Open cphoward opened 11 months ago

cphoward commented 11 months ago

Describe the bug

I am attempting to run the LLaMA2 demo at https://github.com/openvinotoolkit/model_server/blob/main/demos/llama_chat/python/README.md. When I run:

python client.py --url localhost:9000 --question "Write python function to sum 3 numbers." --seed 1332 --actor python-programmer

I get

raise _InactiveRpcError(state)  # pytype: disable=not-instantiable
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
    status = StatusCode.INVALID_ARGUMENT
    details = "Invalid number of inputs - Expected: 67; Actual: 66"
    debug_error_string = "UNKNOWN:Error received from peer  {grpc_message:"Invalid number of inputs - Expected: 67; Actual: 66", grpc_status:3, created_time:"2023-12-20T17:45:16.689999237+00:00"}"

To Reproduce Steps to reproduce the behavior:

  1. Follow the demo steps.

Expected behavior A clear and concise description of what you expected to happen.

I expected results similar to demo documentation

Configuration

docker run -d --rm -p 9000:9000 -v $(pwd)/models/llama-2-7b-hf:/model:ro openvino/model_server \
    --port 9000 \
    --model_name llama \
    --model_path /model \
    --plugin_config '{"PERFORMANCE_HINT":"LATENCY","NUM_STREAMS":1}'

Additional context I did install nncf for int8 compression. Is there a way to configure the example to use int4 compression?

Update It seems the missing argument is position_ids.

cphoward commented 11 months ago

After playing around with models, I've found something like

def prepare_preprompt_kv_cache(preprompt):
    inputs = tokenizer(preprompt, return_tensors="np", add_special_tokens=False)
    model_inputs = {
        "input_ids": inputs["input_ids"],
        "attention_mask": inputs["attention_mask"]
    }

    # Generate position ids based on the length of the input
    seq_length = inputs["input_ids"].shape[1]
    model_inputs["position_ids"] = np.arange(seq_length)[None, :]

    # Initialize past key values for each layer
    for i in range(32):
        model_inputs[f"past_key_values.{i}.key"] = np.zeros((1, 32, 0, 128), dtype=np.float32)
        model_inputs[f"past_key_values.{i}.value"] = np.zeros((1, 32, 0, 128), dtype=np.float32)

    return client.predict(inputs=model_inputs, model_name='llama')

won't crash if I also change the PREPROMPT to something relatively short. It crashes when attempting to run with the default PREPROMT:

grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
    status = StatusCode.INTERNAL
    details = "Internal inference error"
    debug_error_string = "UNKNOWN:Error received from peer ipv6:%5B::1%5D:9000 {created_time:"2023-12-21T22:42:05.540244646+00:00", grpc_status:13, grpc_message:"Internal inference error"}"

The server logs give:

[2023-12-21 22:25:15.297][62][serving][error][modelinstance.cpp:1168] Async caught an exception Internal inference error: Exception from src/inference/src/infer_request.cpp:256:
Exception from src/inference/src/dev/converter_utils.cpp:707:
[ GENERAL_ERROR ] Shape inference of Multiply node with name __module.model.layers.0.self_attn/aten::mul/Multiply failed: Exception from src/plugins/intel_cpu/src/shape_inference/custom/eltwise.cpp:47:

I have confirmed with a custom script I wrote that the model can do inference, but it's mostly gibberish and results in very few characters.

dkalinowski commented 10 months ago

Hello @cphoward

We have recently removed the demo you refer to. However, please check the new version which uses new MediaPipe python calculator feature that makes it easier to serve llama: https://github.com/openvinotoolkit/model_server/tree/main/demos/python_demos/llm_text_generation

Please let us know if you have any feedback.