kserve / kserve

Standardized Serverless ML Inference Platform on Kubernetes
https://kserve.github.io/website/
Apache License 2.0
3.59k stars 1.06k forks source link

Improve KServe model server observability with metrics and distruted tracing #2668

Open yungParrot opened 1 year ago

yungParrot commented 1 year ago

/kind feature

This issue was specified in the 2023 Roadmap. I wanted to add OpenTelemetry support to the KServe Python SDK so it could be integrated with tools like Jaeger or Zipkin.

I wanted to work on this issue - could you add it please to the KServe 0.11 board?

:cowboy_hat_face:

yuzisun commented 1 year ago

@yungParrot Added!! We would really love to see the OpenTelemetry integration, you can also join our biweekly working group meeting to discuss this.

yungParrot commented 1 year ago

@yuzisun when and where are those meeting held?

I'll need to know what metrics/traces you'll want to measure/track. Adding something like that for example:

from opentelemetry import trace

class KServeClient(object):

    def __init__(self, config_file=None, context=None,  # pylint: disable=too-many-arguments
                 client_configuration=None, persist_config=True):
        """
        KServe client constructor
        :param config_file: kubeconfig file, defaults to ~/.kube/config
        :param context: kubernetes context
        :param client_configuration: kubernetes configuration object
        :param persist_config:
        """
        tracer = trace.get_tracer(__name__)
        with tracer.start_as_current_span("KServeClient") as span:
            span.set_attribute("created object", str(self))

...could be used then by the user through:

from kserve import KServeClient
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor, ConsoleSpanExporter
from opentelemetry.sdk.resources import SERVICE_NAME, Resource

resource = Resource(attributes={
    SERVICE_NAME: "your-service-name"
})
provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(ConsoleSpanExporter())
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

kserve_client = KServeClient(config_file='./kserve/test/kubeconfig')

...and will result in some output like that:

{
    "name": "KServeClient",
    "context": {
        "trace_id": "0xdd22c93c595f9c3d8087d46c416d48cc",
        "span_id": "0x6dd718d4f7f56a99",
        "trace_state": "[]"
    },
    "kind": "SpanKind.INTERNAL",
    "parent_id": null,
    "start_time": "2023-02-05T18:12:20.175541Z",
    "end_time": "2023-02-05T18:12:20.175585Z",
    "status": {
        "status_code": "UNSET"
    },
    "attributes": {
        "created": "<kserve.api.kserve_client.KServeClient object at 0x7efdee1dfa30>"
    },
    "events": [],
    "links": [],
    "resource": {
        "attributes": {
            "service.name": "your-service-name"
        },
        "schema_url": ""
    }
}

The OpenTelemetry exporters (Jaeger/Zipkin/OTLP Collector) would be then configured by the user. What do you think about it? :cowboy_hat_face:

yungParrot commented 1 year ago

when and where are those meeting held?

nvm - I've found the meeting calendar :point_left:

yuzisun commented 1 year ago

@yungParrot Are you still working on this? I think your idea sounds good and the exporter should be configured by the user like the way Knative works, just note that KServeClient is the Kubernetes client not the http client and I think we would want to trace the http request like from transformer to predictor or add the span for preprocess, predict and postprocess.

yungParrot commented 1 year ago

Are you still working on this?

@yuzisun yes, but I didn't have much time last month. I wanted to create a Design Doc to describe what I'm planning to do 🤔 I'll update you once I have something to show you 🤠

yuzisun commented 1 year ago

@yungParrot sounds great! Looking forward to the design doc

yungParrot commented 1 year ago

@yuzisun my design doc is available here - please let me know if there is anything I should improve 🤔

yuzisun commented 1 year ago

@yuzisun my design doc is available here - please let me know if there is anything I should improve 🤔

@yungParrot thanks ! do you want to present on the kserve community meeting next Wednesday?

yungParrot commented 1 year ago

@yungParrot thanks ! do you want to present on the kserve community meeting next Wednesday?

@yuzisun I'm not sure if I will be able to attend that meeting

yuzisun commented 1 year ago

@yungParrot We are planning this feature for KServe 0.12, are you still interested in working on this?

yungParrot commented 1 year ago

@yuzisun yes

sivanantha321 commented 1 year ago

Hey @yungParrot have you started working on this?

yungParrot commented 1 year ago

@sivanantha321 yes, although I've run into some problems and I'm not sure about the next steps. I understand how the Python code can be instrumented through OpenTelemetry and so on, but I'm not sure how to create e2e tests for that and how OpenTelemetry should be configured on the cluster itself - I see that there are multiple ways of installing OpenTelemetry operators/collectors etc. on a k8s cluster and I'm not sure exactly how to approach this problem.

sivanantha321 commented 1 year ago

@yungParrot I can help you with e2e tests. We can connect on slack

andyi2it commented 7 months ago

/assign @andyi2it