opendatahub-io / odh-model-controller

Apache License 2.0
2 stars 48 forks source link

bug: modelmesh container have error logs when kserve runtime is running. #125

Open Jooho opened 10 months ago

Jooho commented 10 months ago

When kserve and modelmeh are running in the same namespace, modelmesh container show these errors:

{"instant":{"epochSecond":1701291690,"nanoOfSecond":966215444},"thread":"ll-elg-thread-2","level":"INFO","loggerName":"com.ibm.watson.modelmesh.ModelMesh","message":"Returning READY to readiness probe (did not find any other pods in terminating state)","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","contextMap":{},"threadId":35,"threadPriority":5}
Nov 29, 2023 9:01:37 PM io.grpc.netty.NettyServerTransport notifyTerminated
INFO: Transport failed
io.netty.handler.codec.http2.Http2Exception: Unexpected HTTP/1.x request: GET /stats/prometheus
at io.netty.handler.codec.http2.Http2Exception.connectionError(Http2Exception.java:109)
at io.netty.handler.codec.http2.Http2ConnectionHandler$PrefaceDecoder.readClientPrefaceString(Http2ConnectionHandler.java:317)
at io.netty.handler.codec.http2.Http2ConnectionHandler$PrefaceDecoder.decode(Http2ConnectionHandler.java:247)
at io.netty.handler.codec.http2.Http2ConnectionHandler.decode(Http2ConnectionHandler.java:453)
at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:529)
at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:468)

There are 3 networkpolicy in the namespace:

If allow-from-openshift-monitoring-ns network policy is deleted, the error message is not showing up anymore. So I think this networkpolicy is the culprit of this issue. However, it is not 100% so it needs more debugging.

Reference: https://github.com/orgs/opendatahub-io/projects/42?pane=issue&itemId=40292089

skonto commented 9 months ago

By setting istio-prometheus-ignore="true" you can avoid scraping on port 15020 happening on the modelmesh pod. See:

Name:         istio-proxies-monitor
Namespace:    kserve-demo
... 
Spec:
  Namespace Selector:
  Pod Metrics Endpoints:
    Bearer Token Secret:
      Key:     
    Interval:  30s
    Path:      /stats/prometheus
  Selector:
    Match Expressions:
      Key:       istio-prometheus-ignore
      Operator:  DoesNotExist
vaibhavjainwiz commented 9 months ago

Analysis As a part to setup OMW(Openshift Monitoring Workflow), a Pod monitor(istio-proxies-monitor) has been created which allow to directy scrap metrics from all Pod in KServe runtime namespace on /stats/prometheus endpoint at HTTP port.

ModelMesh already have a ServiceMonitor resource on its pod which allows the metric scraping through secure port. istio-proxies-monitor should not monitor ModelMesh pod.

Solution istio-proxies-monitor(PodMonitor) and istiod-monitor(ServiceMonitor) are suppose to monitor Istio component not the Kserve. As I verifies equivalent Istio PodMonitor and ServiceMonitor are already created in Istio-system namespace. So I think we could safely remove the istio-proxies-monitor(PodMonitor) and istiod-monitor(ServiceMonitor) from Kserve namespace.

model-serving-Page-16 drawio

Jooho commented 9 months ago

So I think we could safely remove the istio-proxies-monitor(PodMonitor) and istiod-monitor(ServiceMonitor) from Kserve namespace.

Do you know who created these two objects?

vaibhavjainwiz commented 9 months ago

So I think we could safely remove the istio-proxies-monitor(PodMonitor) and istiod-monitor(ServiceMonitor) from Kserve namespace.

Do you know who created these two objects?

Today I do some more research around it and found below article. According to point 7.1 its intentionally added in there. We should not remove the istio-proxies-monitor(PodMonitor) and istiod-monitor(ServiceMonitor) from Kserve namespace. https://docs.openshift.com/container-platform/4.14/service_mesh/v2x/ossm-observability.html#ossm-integrating-with-user-workload-monitoring_observability

vaibhavjainwiz commented 9 months ago

After discussing with @skonto @bartoszmajsak , we come to the point that we need to add extra label in istio-proxies-monitor PodMonitor to skip ModelMesh pod monitoring.

Discussion thread : https://redhat-internal.slack.com/archives/C065ARTVA80/p1702293019814919?thread_ts=1701693652.733169&cid=C065ARTVA80