Demo: LLM models with Continuous Batching via OpenAI API not working

Describe the bug:

Hi everyone, I wanted to serve an optimized LLM model using OVMS. I have tried to follow the continuous batching demo but when I run the container and check the model status it is loading and with errors indicating the deployment failed, meaning the model endpoint is not ready and I cannot make client requests.

To Reproduce I have followed the demo steps:

I have pulled the latest OVMS image:

docker pull openvino/model_server:latest

I have installed the required python dependencies in my environment (python 3.10):

export PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu https://storage.openvinotoolkit.org/simple/wheels/pre-release"
pip3 install --pre "optimum-intel[nncf,openvino]"@git+https://github.com/huggingface/optimum-intel.git  openvino_tokenizers==2024.4.* openvino==2024.4.*

I have moved to the required folder and I have run the optimum-cli to export the the model and tokenizer to IR format (.bin and .xml) with dtype=FP16 :

cd demos/continuous_batching
convert_tokenizer -o Meta-Llama-3-8B-Instruct --utf8_replace_mode replace --with-detokenizer --skip-special-tokens --streaming-detokenizer --not-add-special-tokens meta-llama/Meta-Llama-3-8B-Instruct
optimum-cli export openvino --disable-convert-tokenizer --model meta-llama/Meta-Llama-3-8B-Instruct --weight-format fp16 Meta-Llama-3-8B-Instruct

It might be an important detail to point out that as I had the HF model already downloaded in a specific path I have set the env variable export HF_HOME="/mnt/shared_models/huggingface/cache".

I have copied the graph file into the optimized model folder without changing any field:

cp graph.pbtxt Meta-Llama-3-8B-Instruct/graph.pbtxt

I have all the expected files inside de model folder:

I have prepared theconfig.json file as provided:

And finally I run the container:

docker run -d --rm -p 8000:8000 -v $(pwd)/:/workspace:ro openvino/model_server:latest --port 9000 --rest_port 8000 --config_path /workspace/config.json

And when running the command curl http://localhost:8000/v1/config to check the served model status I get this output:

{
"meta-llama/Meta-Llama-3-8B-Instruct" : 
{
 "model_version_status": [
  {
   "version": "1",
   "state": "LOADING",
   "status": {
    "error_code": "FAILED_PRECONDITION",
    "error_message": "FAILED_PRECONDITION"
   }
}
]
}

When I was expecting this other output:

{
    "meta-llama/Meta-Llama-3-8B-Instruct": {
        "model_version_status": [
            {
                "version": "1",
                "state": "AVAILABLE",
                "status": {
                    "error_code": "OK",
                    "error_message": "OK"
                }
            }
        ]
    }
}

Logs To debug I have tried to run the container but deploying the REST port from the inside: First I run the container without the --rest_port flag:

docker run -d --rm -p 8000:8000 -v $(pwd)/:/workspace:ro openvino/model_server:latest --port 9000 --config_path /workspace/config.json

When I get inside the container with docker exec -it <container_id> bash and try to deploy the model myself from the inside ovms/bin/ovms --rest_port 8000 --config_path /workspace/config.json --log_level DEBUG

I get this output trace being the most relevant error description: Error parsing text-format mediapipe.CalculatorGraphConfig: 20:26: Expected string, got: { [2024-09-20 16:48:23.593][460][modelmanager][error][mediapipegraphdefinition.cpp:95] Trying to parse mediapipe graph definition: meta-llama/Meta-Llama-3-8B-Instruct failed I show just the relevant part of the trace not to be too long:

[2024-09-20 16:37:31.737][60][serving][info][grpcservermodule.cpp:122] GRPCServerModule starting
[2024-09-20 16:37:31.737][60][serving][debug][grpcservermodule.cpp:146] setting grpc channel argument grpc.max_concurrent_streams: 64
[2024-09-20 16:37:31.737][60][serving][debug][grpcservermodule.cpp:159] setting grpc MaxThreads ResourceQuota 512
[2024-09-20 16:37:31.737][60][serving][debug][grpcservermodule.cpp:163] setting grpc Memory ResourceQuota 2147483648
[2024-09-20 16:37:31.737][60][serving][debug][grpcservermodule.cpp:170] Starting gRPC servers: 1
[2024-09-20 16:37:31.738][60][serving][info][grpcservermodule.cpp:191] GRPCServerModule started
[2024-09-20 16:37:31.738][60][serving][info][grpcservermodule.cpp:192] Started gRPC server on port 9178
[2024-09-20 16:37:31.738][60][serving][info][httpservermodule.cpp:33] HTTPServerModule starting
[2024-09-20 16:37:31.738][60][serving][info][httpservermodule.cpp:37] Will start 256 REST workers
[2024-09-20 16:37:31.757][60][serving][info][http_server.cpp:269] REST server listening on port 8000 with 256 threads
[2024-09-20 16:37:31.757][60][serving][info][httpservermodule.cpp:47] HTTPServerModule started
[2024-09-20 16:37:31.757][60][serving][info][httpservermodule.cpp:48] Started REST server at 0.0.0.0:8000
[2024-09-20 16:37:31.757][60][serving][info][servablemanagermodule.cpp:51] ServableManagerModule starting
[2024-09-20 16:37:31.757][60][modelmanager][debug][modelmanager.cpp:874] Loading configuration from /workspace/config.json for: 1 time
[evhttp_server.cc : 253] NET_LOG: Entering the event loop ...
[2024-09-20 16:37:31.757][60][modelmanager][debug][modelmanager.cpp:678] Configuration file doesn't have monitoring property.
[2024-09-20 16:37:31.757][60][modelmanager][debug][modelmanager.cpp:926] Reading metric config only once per server start.
[2024-09-20 16:37:31.757][60][serving][debug][mediapipegraphconfig.cpp:102] graph_path not defined in config so it will be set to default based on base_path and graph name: /workspace/Meta-Llama-3-8B-Instruct/graph.pbtxt
[2024-09-20 16:37:31.757][60][serving][debug][mediapipegraphconfig.cpp:110] No subconfig path was provided for graph: meta-llama/Meta-Llama-3-8B-Instruct so default subconfig file: /workspace/Meta-Llama-3-8B-Instruct/subconfig.json will be loaded.
[2024-09-20 16:37:31.757][60][modelmanager][debug][modelmanager.cpp:783] Subconfig path: /workspace/Meta-Llama-3-8B-Instruct/subconfig.json provided for graph: meta-llama/Meta-Llama-3-8B-Instruct does not exist. Loading subconfig models will be skipped.
[2024-09-20 16:37:31.757][60][modelmanager][info][modelmanager.cpp:536] Configuration file doesn't have custom node libraries property.
[2024-09-20 16:37:31.757][60][modelmanager][info][modelmanager.cpp:579] Configuration file doesn't have pipelines property.
[2024-09-20 16:37:31.757][60][modelmanager][debug][modelmanager.cpp:368] Mediapipe graph:meta-llama/Meta-Llama-3-8B-Instruct was not loaded so far. Triggering load
[2024-09-20 16:37:31.757][60][modelmanager][debug][mediapipegraphdefinition.cpp:120] Started validation of mediapipe: meta-llama/Meta-Llama-3-8B-Instruct
[libprotobuf ERROR external/com_google_protobuf/src/google/protobuf/text_format.cc:335] **Error parsing text-format mediapipe.CalculatorGraphConfig: 20:26: Expected string, got: {**
[2024-09-20 16:37:31.758][60][modelmanager][error][mediapipegraphdefinition.cpp:95] Trying to parse mediapipe graph definition: meta-llama/Meta-Llama-3-8B-Instruct failed
[2024-09-20 16:37:31.758][60][modelmanager][debug][pipelinedefinitionstatus.hpp:50] Mediapipe: meta-llama/Meta-Llama-3-8B-Instruct state: BEGIN handling: ValidationFailedEvent: 
[2024-09-20 16:37:31.758][60][modelmanager][info][pipelinedefinitionstatus.hpp:59] Mediapipe: meta-llama/Meta-Llama-3-8B-Instruct state changed to: LOADING_PRECONDITION_FAILED after handling: ValidationFailedEvent: 
[2024-09-20 16:37:31.758][362][modelmanager][info][modelmanager.cpp:1068] Started model manager thread
[2024-09-20 16:37:31.758][60][serving][info][servablemanagermodule.cpp:55] ServableManagerModule started
[2024-09-20 16:37:31.758][363][modelmanager][info][modelmanager.cpp:1087] Started cleaner thread

Configuration

OVMS version

OpenVINO Model Server 2024.4.28219825c
OpenVINO backend c3152d32c9c7
Bazel build flags: --strip=always --define MEDIAPIPE_DISABLE=0 --define PYTHON_DISABLE=0 --//:distro=ubuntu

OVMS config.json file

{
"model_config_list": [],
"mediapipe_config_list": [
    {
          "name": "meta-llama/Meta-Llama-3-8B-Instruct",
        "base_path": "Meta-Llama-3-8B-Instruct"
    }
]
}

CPU, accelerator's versions if applicable

Architecture:            x86_64
CPU op-mode(s):        32-bit, 64-bit
Address sizes:         46 bits physical, 57 bits virtual
Byte Order:            Little Endian
CPU(s):                  64
On-line CPU(s) list:   0-63
Vendor ID:               GenuineIntel
Model name:            Intel(R) Xeon(R) Gold 6426Y
CPU family:          6
Model:               143
Thread(s) per core:  2
Core(s) per socket:  16
Socket(s):           2

Model repository directory structure

workspace
├── config.json
└── Meta-Llama-3-8B-Instruct
├── config.json
├── generation_config.json
├── graph.pbtxt
├── openvino_detokenizer.bin
├── openvino_detokenizer.xml
├── openvino_model.bin
├── openvino_model.xml
├── openvino_tokenizer.bin
├── openvino_tokenizer.xml
├── special_tokens_map.json
├── tokenizer_config.json
└── tokenizer.json

Model or publicly available similar model that reproduces the issue meta-llama/Meta-Llama-3-8B-Instruct

Additional context I have tried to change the model repository structure trying to add the model files insideworkspace/Meta-Llama-3-8B-Instruct/1/but it hasn't work either.

openvinotoolkit / model_server

Demo: LLM models with Continuous Batching via OpenAI API not working #2704