openvinotoolkit / model_server

A scalable inference server for models optimized with OpenVINO™
https://docs.openvino.ai/2024/ovms_what_is_openvino_model_server.html
Apache License 2.0
675 stars 212 forks source link

Demo: LLM models with Continuous Batching via OpenAI API not working #2704

Closed paguilomanas closed 1 month ago

paguilomanas commented 1 month ago

Describe the bug:

Hi everyone, I wanted to serve an optimized LLM model using OVMS. I have tried to follow the continuous batching demo but when I run the container and check the model status it is loading and with errors indicating the deployment failed, meaning the model endpoint is not ready and I cannot make client requests.

To Reproduce I have followed the demo steps:

  1. I have pulled the latest OVMS image:

docker pull openvino/model_server:latest

  1. I have installed the required python dependencies in my environment (python 3.10):
export PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu https://storage.openvinotoolkit.org/simple/wheels/pre-release"
pip3 install --pre "optimum-intel[nncf,openvino]"@git+https://github.com/huggingface/optimum-intel.git  openvino_tokenizers==2024.4.* openvino==2024.4.*
  1. I have moved to the required folder and I have run the optimum-cli to export the the model and tokenizer to IR format (.bin and .xml) with dtype=FP16 :
cd demos/continuous_batching
convert_tokenizer -o Meta-Llama-3-8B-Instruct --utf8_replace_mode replace --with-detokenizer --skip-special-tokens --streaming-detokenizer --not-add-special-tokens meta-llama/Meta-Llama-3-8B-Instruct
optimum-cli export openvino --disable-convert-tokenizer --model meta-llama/Meta-Llama-3-8B-Instruct --weight-format fp16 Meta-Llama-3-8B-Instruct 

It might be an important detail to point out that as I had the HF model already downloaded in a specific path I have set the env variable export HF_HOME="/mnt/shared_models/huggingface/cache".

  1. I have copied the graph file into the optimized model folder without changing any field:

cp graph.pbtxt Meta-Llama-3-8B-Instruct/graph.pbtxt

image

I have all the expected files inside de model folder:

image
  1. I have prepared theconfig.json file as provided:
image
  1. And finally I run the container:

docker run -d --rm -p 8000:8000 -v $(pwd)/:/workspace:ro openvino/model_server:latest --port 9000 --rest_port 8000 --config_path /workspace/config.json

And when running the command curl http://localhost:8000/v1/config to check the served model status I get this output:

{
"meta-llama/Meta-Llama-3-8B-Instruct" : 
{
 "model_version_status": [
  {
   "version": "1",
   "state": "LOADING",
   "status": {
    "error_code": "FAILED_PRECONDITION",
    "error_message": "FAILED_PRECONDITION"
   }
}
]
}

When I was expecting this other output:

{
    "meta-llama/Meta-Llama-3-8B-Instruct": {
        "model_version_status": [
            {
                "version": "1",
                "state": "AVAILABLE",
                "status": {
                    "error_code": "OK",
                    "error_message": "OK"
                }
            }
        ]
    }
}

Logs To debug I have tried to run the container but deploying the REST port from the inside: First I run the container without the --rest_port flag:

docker run -d --rm -p 8000:8000 -v $(pwd)/:/workspace:ro openvino/model_server:latest --port 9000 --config_path /workspace/config.json

When I get inside the container with docker exec -it <container_id> bash and try to deploy the model myself from the inside ovms/bin/ovms --rest_port 8000 --config_path /workspace/config.json --log_level DEBUG

I get this output trace being the most relevant error description: Error parsing text-format mediapipe.CalculatorGraphConfig: 20:26: Expected string, got: { [2024-09-20 16:48:23.593][460][modelmanager][error][mediapipegraphdefinition.cpp:95] Trying to parse mediapipe graph definition: meta-llama/Meta-Llama-3-8B-Instruct failed I show just the relevant part of the trace not to be too long:

[2024-09-20 16:37:31.737][60][serving][info][grpcservermodule.cpp:122] GRPCServerModule starting
[2024-09-20 16:37:31.737][60][serving][debug][grpcservermodule.cpp:146] setting grpc channel argument grpc.max_concurrent_streams: 64
[2024-09-20 16:37:31.737][60][serving][debug][grpcservermodule.cpp:159] setting grpc MaxThreads ResourceQuota 512
[2024-09-20 16:37:31.737][60][serving][debug][grpcservermodule.cpp:163] setting grpc Memory ResourceQuota 2147483648
[2024-09-20 16:37:31.737][60][serving][debug][grpcservermodule.cpp:170] Starting gRPC servers: 1
[2024-09-20 16:37:31.738][60][serving][info][grpcservermodule.cpp:191] GRPCServerModule started
[2024-09-20 16:37:31.738][60][serving][info][grpcservermodule.cpp:192] Started gRPC server on port 9178
[2024-09-20 16:37:31.738][60][serving][info][httpservermodule.cpp:33] HTTPServerModule starting
[2024-09-20 16:37:31.738][60][serving][info][httpservermodule.cpp:37] Will start 256 REST workers
[2024-09-20 16:37:31.757][60][serving][info][http_server.cpp:269] REST server listening on port 8000 with 256 threads
[2024-09-20 16:37:31.757][60][serving][info][httpservermodule.cpp:47] HTTPServerModule started
[2024-09-20 16:37:31.757][60][serving][info][httpservermodule.cpp:48] Started REST server at 0.0.0.0:8000
[2024-09-20 16:37:31.757][60][serving][info][servablemanagermodule.cpp:51] ServableManagerModule starting
[2024-09-20 16:37:31.757][60][modelmanager][debug][modelmanager.cpp:874] Loading configuration from /workspace/config.json for: 1 time
[evhttp_server.cc : 253] NET_LOG: Entering the event loop ...
[2024-09-20 16:37:31.757][60][modelmanager][debug][modelmanager.cpp:678] Configuration file doesn't have monitoring property.
[2024-09-20 16:37:31.757][60][modelmanager][debug][modelmanager.cpp:926] Reading metric config only once per server start.
[2024-09-20 16:37:31.757][60][serving][debug][mediapipegraphconfig.cpp:102] graph_path not defined in config so it will be set to default based on base_path and graph name: /workspace/Meta-Llama-3-8B-Instruct/graph.pbtxt
[2024-09-20 16:37:31.757][60][serving][debug][mediapipegraphconfig.cpp:110] No subconfig path was provided for graph: meta-llama/Meta-Llama-3-8B-Instruct so default subconfig file: /workspace/Meta-Llama-3-8B-Instruct/subconfig.json will be loaded.
[2024-09-20 16:37:31.757][60][modelmanager][debug][modelmanager.cpp:783] Subconfig path: /workspace/Meta-Llama-3-8B-Instruct/subconfig.json provided for graph: meta-llama/Meta-Llama-3-8B-Instruct does not exist. Loading subconfig models will be skipped.
[2024-09-20 16:37:31.757][60][modelmanager][info][modelmanager.cpp:536] Configuration file doesn't have custom node libraries property.
[2024-09-20 16:37:31.757][60][modelmanager][info][modelmanager.cpp:579] Configuration file doesn't have pipelines property.
[2024-09-20 16:37:31.757][60][modelmanager][debug][modelmanager.cpp:368] Mediapipe graph:meta-llama/Meta-Llama-3-8B-Instruct was not loaded so far. Triggering load
[2024-09-20 16:37:31.757][60][modelmanager][debug][mediapipegraphdefinition.cpp:120] Started validation of mediapipe: meta-llama/Meta-Llama-3-8B-Instruct
[libprotobuf ERROR external/com_google_protobuf/src/google/protobuf/text_format.cc:335] **Error parsing text-format mediapipe.CalculatorGraphConfig: 20:26: Expected string, got: {**
[2024-09-20 16:37:31.758][60][modelmanager][error][mediapipegraphdefinition.cpp:95] Trying to parse mediapipe graph definition: meta-llama/Meta-Llama-3-8B-Instruct failed
[2024-09-20 16:37:31.758][60][modelmanager][debug][pipelinedefinitionstatus.hpp:50] Mediapipe: meta-llama/Meta-Llama-3-8B-Instruct state: BEGIN handling: ValidationFailedEvent: 
[2024-09-20 16:37:31.758][60][modelmanager][info][pipelinedefinitionstatus.hpp:59] Mediapipe: meta-llama/Meta-Llama-3-8B-Instruct state changed to: LOADING_PRECONDITION_FAILED after handling: ValidationFailedEvent: 
[2024-09-20 16:37:31.758][362][modelmanager][info][modelmanager.cpp:1068] Started model manager thread
[2024-09-20 16:37:31.758][60][serving][info][servablemanagermodule.cpp:55] ServableManagerModule started
[2024-09-20 16:37:31.758][363][modelmanager][info][modelmanager.cpp:1087] Started cleaner thread

Configuration

  1. OVMS version
    OpenVINO Model Server 2024.4.28219825c
    OpenVINO backend c3152d32c9c7
    Bazel build flags: --strip=always --define MEDIAPIPE_DISABLE=0 --define PYTHON_DISABLE=0 --//:distro=ubuntu
  2. OVMS config.json file
    {
    "model_config_list": [],
    "mediapipe_config_list": [
        {
              "name": "meta-llama/Meta-Llama-3-8B-Instruct",
            "base_path": "Meta-Llama-3-8B-Instruct"
        }
    ]
    }
  3. CPU, accelerator's versions if applicable
    Architecture:            x86_64
    CPU op-mode(s):        32-bit, 64-bit
    Address sizes:         46 bits physical, 57 bits virtual
    Byte Order:            Little Endian
    CPU(s):                  64
    On-line CPU(s) list:   0-63
    Vendor ID:               GenuineIntel
    Model name:            Intel(R) Xeon(R) Gold 6426Y
    CPU family:          6
    Model:               143
    Thread(s) per core:  2
    Core(s) per socket:  16
    Socket(s):           2
  4. Model repository directory structure
    workspace
    ├── config.json
    └── Meta-Llama-3-8B-Instruct
    ├── config.json
    ├── generation_config.json
    ├── graph.pbtxt
    ├── openvino_detokenizer.bin
    ├── openvino_detokenizer.xml
    ├── openvino_model.bin
    ├── openvino_model.xml
    ├── openvino_tokenizer.bin
    ├── openvino_tokenizer.xml
    ├── special_tokens_map.json
    ├── tokenizer_config.json
    └── tokenizer.json
  5. Model or publicly available similar model that reproduces the issue meta-llama/Meta-Llama-3-8B-Instruct

Additional context I have tried to change the model repository structure trying to add the model files insideworkspace/Meta-Llama-3-8B-Instruct/1/but it hasn't work either.

paguilomanas commented 1 month ago

I found a more updated version of graph.pbtxt #2688 which is working:

input_stream: "HTTP_REQUEST_PAYLOAD:input"
output_stream: "HTTP_RESPONSE_PAYLOAD:output"

node: {
  name: "LLMExecutor"
  calculator: "HttpLLMCalculator"
  input_stream: "LOOPBACK:loopback"
  input_stream: "HTTP_REQUEST_PAYLOAD:input"
  input_side_packet: "LLM_NODE_RESOURCES:llm"
  output_stream: "LOOPBACK:loopback"
  output_stream: "HTTP_RESPONSE_PAYLOAD:output"
  input_stream_info: {
    tag_index: "LOOPBACK:0",
    back_edge: true
  }
  node_options: {
      [type.googleapis.com / mediapipe.LLMCalculatorOptions]: {
          models_path: "./",
          plugin_config: '{"KV_CACHE_PRECISION": "u8", "DYNAMIC_QUANTIZATION_GROUP_SIZE": "32"}',
          enable_prefix_caching: false
          cache_size: 10
      }
  }
  input_stream_handler {
    input_stream_handler: "SyncSetInputStreamHandler",
    options {
      [mediapipe.SyncSetInputStreamHandlerOptions.ext] {
        sync_set {
          tag_index: "LOOPBACK:0"
        }
      }
    }
  }
}