sungsoo / sungsoo.github.io

Sung-Soo Kim's Blog
30 stars 8 forks source link

analysis: Introduction to KServe #22

Open sungsoo opened 2 years ago

sungsoo commented 2 years ago

KServe: A Robust and Extensible Cloud Native Model Server

Related Issues

Article Source

If you are familiar with Kubeflow, you know KFServing as the platform’s model server and inference engine. In September last year, the KFServing project has gone through a transformation to become KServe.

KServe is now an independent component graduating from the Kubeflow project, apart from the name change. The separation allows KServe to evolve as a separate, cloud native inference engine deployed as a standalone model server. Of course, it will continue to have tight integration with Kubeflow, but they would be treated and maintained as independent open source projects.

For a brief overview of the model server, refer to one of my previous articles at The New Stack.

KServe is collaboratively developed by Google, IBM, Bloomberg, Nvidia, and Seldon as an open source, cloud native model server for Kubernetes. The most recent version, 0.8, squarely focused on transforming the model server into a standalone component with changes to the taxonomy and nomenclature.

Let’s understand the core capabilities of KServe.

A model server is to machine learning models what an application is to code binaries. Both provide the runtime and execution context to the deployments. KServe, as a model server, provides the foundation for serving machine learning and deep learning models at scale.

KServe can be deployed as a traditional Kubernetes deployment or as a serverless deployment with support for scale-to-zero. For serverless, it takes advantage of Knative Serving for serverless, which comes with automatic scale-up and scale-down capabilities. Istio is used as an ingress to expose the service endpoints to the API consumers. The combination of Istio and Knative Serving enables exciting scenarios such as blue/green and canary deployments of models.

Kserve architecture diagram

The RawDeployment Mode, which lets you use KServe without Knative Serving, supports traditional scaling techniques such as Horizontal Pod Autoscaler (HPA) but lacks support for scale-to-zero.

KServe Architecture

KServe model server has a control plane and a data plane. The control plane manages and reconciles the custom resources responsible for inference. In serverless mode, It coordinates with Knative resources in managing the autoscale.

Kserve control plane

At the heart of KServe control plane is the KServe Controller that manages the lifecycle of an inference service. It is responsible for creating service, ingress resources, model server container, model agent container for request/response logging, batching, and pulling the models from the model store. The model store is a repository of models registered with the model server. It is typically an object storage service such as Amazon S3, Google Cloud Storage, Azure Storage, or MinIO.

The data plane manages the request/response cycle targeting a specific model. It has a predictor, transformer, and explainer components.

An AI application sends a REST or gRPC request to the predictor endpoint. The predictor acts as an inference pipeline that invokes the transformer component, which can perform pre-processing of the inbound data (request) and post-processing of outbound data (response). Optionally, there may be an explainer component to bring AI explainability to the hosted models. KServe encourages the usage of V2 protocol which is interoperable and extensible.

The data plane also has endpoints to check the readiness and health of models. It also exposes APIs for retrieving model metadata.

Supported Frameworks and Runtimes

KServe supports a wide range of machine learning and deep learning frameworks. Deep learning frameworks and runtimes work with existing serving infrastructures such as TensorFlow Serving, TorchServe, and Triton Inference Server. KServe can host TensorFlow, ONNX, PyTorch, TensorRT runtimes through Triton.

For classical machine learning models based on SKLearn, XGBoost, Spark MLLib, and LightGBM KServe rely on Seldon’s MLServer.

The extensible framework of KServe makes it possible to plugin any runtime that adheres to the V2 inference protocol.

Multimodel Serving with ModelMesh

KServe deploys one model per inference, limiting the platform’s scalability to the available CPUs and GPUs. This limitation becomes obvious when running inference on GPUs which are expensive and scarce compute resources.

With Multimodel serving, we can overcome the limitations of the infrastructure — compute resources, maximum pods, and maximum IP addresses.

ModelMesh Serving, developed by IBM, is a Kubernetes-based platform for a real-time serving of ML/DL models, optimized for high volume/density use cases. Similar to an operating system that manages processes to optimally utilize the available resources, ModelMesh optimizes the deployed models to run efficiently within the cluster.

ModelMesh serving diagram

Through intelligent management of in-memory model data across clusters of deployed pods, and the usage of those models over time, the system maximizes the use of available cluster resources.

ModelMesh Serving is based on KServe v2 data plane API for inferencing, which makes it possible to deploy it as a runtime similar to NVIDIA Triton Inference Server. When a request hits the KServe data plane, it is simply delegated to ModelMesh Serving.

The integration of ModelMesh Serving with KServe is currently in Alpha. As both the projects mature, there will be a tighter integration making it possible to mix and match the features and capabilities of both platforms.

With model serving becoming the core building block of MLOps, open source projects such as KServe become important. The extensibility of KServe to use existing and upcoming runtimes makes it a unique model serving platform.

In the upcoming articles, I will walk you through the steps of deploying KServe on a GPU-based Kubernetes cluster to perform inference on a TensorFlow model. Stay tuned.

sungsoo commented 2 years ago

Model Server: The Critical Building Block of MLOps

Article Source

When we think of machine learning, what comes to mind are the datasets, algorithms, deep learning frameworks, and training the neural networks. While they play an important role in the lifecycle of a model, there is more to it. The most crucial step in a typical machine learning operations (MLOps) implementation is deploying and monitoring models, which is often an afterthought.

A common misconception is that deploying models is as simple as wrapping them in a Flask or Django API layer and exposing them through a REST endpoint. Unfortunately, this is not the most scalable or efficient approach in operationalizing ML models. We need a robust infrastructure for managing the deployments and the inference of the models.

With containers becoming the de facto standard for deploying modern applications, the infrastructure for serving models should integrate well with the cloud native platforms such as Kubernetes and Prometheus.

What Is a Model Server?

If you have consumed cloud-based AI services such as Amazon Reckognition, Azure Cognitive Services, and Google Cloud AI Services, you appreciate those APIs’ simplicity and convenience. Simply put, a model server lets you build a similar platform to deliver inference as a service.

A model server is to machine learning models what an application server is to binaries. Just like an application server provides the runtime and deployment services for WAR/JAR files, DLLs, and executables, a model server provides the runtime context for machine learning and deep learning models. It then exposes the deployed models as REST/gRPC endpoints.

A model server is to machine learning models what an application server is to binaries.

Since a model server effectively decouples the inference code with the model artifact, it scales better when compared to a self-hosted Flask or Django web API. This decoupling enables MLOps engineers to deploy new versions of the model without changing the client inference code.

TensorFlow Serving, TorchServe, Multi Model Server, OpenVINO Model Server, Triton Inference Server, BentoML, Seldon Core, and KServe are some of the most popular model servers. Though they are designed for a specific framework or runtime, the architecture is extensible enough to support multiple machine learning and deep learning frameworks.

Model Server Architecture

A typical model server loads the model artifacts and dependencies from a centralized location which could be a shared filesystem or an object storage bucket. It then associates the model with the corresponding runtime environment such as TensorFlow or PyTorch before exposing it as a REST/gRPC endpoint. The model server also captures the metrics related to API invocation and inference output. These metrics are useful for monitoring the performance of each model and also the health of the overall model serving infrastructure.

Let’s take a look at each of the components of a model server:

Client

The client is a web, desktop, or mobile application that consumes the model exposed by the model server through APIs. Any client capable of making an HTTP request can interact with the model server. For performance and scalability, clients can use the gRPC endpoint instead of REST. Model servers also publish client SDKs that simplify the integration of ML APIs with applications.

Model Server

The model server is responsible for loading the models, reading the associated metadata, then instantiating the endpoints. It routes the client requests to an appropriate version of the model. The most important function of a model server is to efficiently manage the compute resources by dynamically mapping and unmapping the active models. For example, the model server may load and unload a model from the GPU depending on the request queue length and the frequency of invocation. This technique makes it possible to utilize the same GPU for multiple models without locking the resources.

Runtimes/Backends

A model server may support one or more frameworks and runtimes. It has an extensible architecture to bring new frameworks into the stack. With a pluggable architecture, it is possible to implement a new framework and runtime. For example, Nvidia’s Triton Inference Server supports multiple frameworks, including TensorRT, TensorFlow, PyTorch, ONNX, OpenVINO, XGBoost, and Scikit-learn.

Model Registry

The model registry is a centralized persistent layer to store model artifacts and binaries. It is accessed by the model server to load a specific version of the model requested by a client. A model registry may store multiple models and multiple versions of the same model. Each model also contains additional metadata describing the runtime requirements, input parameters, data types, and output parameters. It may optionally include a text/JSON file with the labels that can be used to associate the inference output with a meaningful label.

Though the model registry could be a directory on the filesystem, an object storage bucket is preferred. When multiple instances of the registry are run, an object storage layer serves better than the filesystem.

For a detailed explanation and a step-by-step tutorial, refer to my guide on using MinIO as the model store for Nvidia Triton Inference Server running on Kubernetes.

Metrics

The model server exposes a metrics endpoint that can be scrapped by a metrics server such as Prometheus. Apart from monitoring the health of the model serving infrastructure, the metrics service can be used for tracking the API metrics such as the number of concurrent requests, current request queue, and latency.

sungsoo commented 2 years ago

Istio References