Rethink Metrics monitoring Stack in ODH

Is your feature request related to a problem? Please describe. Sicne Openshift 4.6, I believe, there has been a new feature in both OKD and OCP called "Monitoring for User Defined Projects". Enabling it on a cluster leads to all non-kube and non-openshift Namespaces being monitored by a different prometheus in openshift-user-workload-monitoring namespace. At the same time, application metrics timeseries from ServiceMonitors and PodMonitors, as well as kube-state-metrics container- and pod and PVC metrics are available per-namespace, nicely separated by namespace, with its own RBAC.

The only metrics that cannot be retrieved this way are node-exporter node-level metrics.

The thing is, Red Hat does not recommend mixing own implementations of prometheus operators (we did that on OCP 3.11 in the past and pre-OCP-4.6) with Monitoring for User Defined Projects.

"In OpenShift Container Platform 4.10 you must remove any custom Prometheus instances before enabling monitoring for user-defined projects".

https://docs.openshift.com/container-platform/4.10/monitoring/enabling-monitoring-for-user-defined-projects.html

Describe the solution you'd like

Bildschirmfoto 2023-03-20 um 15 10 59

Q: Could you make the federation servicemonitor and prometheus and prometheus operator optional via an overlay in model-mesh? That way, you'd have all the metrics gathering still in there, while making it possible for users who have monitoring for user defined projects enabled to skip the prometheus part and cluster metrics federation part.

With Openshift Monitoring for User Defined Projects, the bringing-in / federation of cluster-level metrics from the kube-state-metrics exporter (pod container restarts , oom, all that stuff) happens automatically at the namespace-level. The only thing not accessible are node-level (node exporter) metrics. Meaning I get such things as kube_pod_container_restarts without an explicit federation servicemonitor.

ClusterRoles that are available in namespace-level rolebindings are described here

https://docs.openshift.com/container-platform/4.10/monitoring/enabling-monitoring-for-user-defined-projects.html#granting-users-permission-to-monitor-user-defined-projects_enabling-monitoring-for-user-defined-projects

An Observe-section is also available on the Web Console GUI for all users with at least view Clusterole on a project as well as one of the clusterRoles mentioned in the link above.

See screenshots of per-namespace query-window and alerts window here https://access.redhat.com/documentation/en-us/openshift_container_platform/4.9/html/building_applications/odc-monitoring-project-and-application-metrics-using-developer-perspective

The monitoring of the metrics from odh-model-controller ServiceMonitor works with Monitoring for User Defined Projects, too, by the way.

That is, the section with the custom monitoring implementation for model mesh could be removed from odh-core, as it is achieved with Monitoring for User Defined Projects.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

Q: Could you make the federation servicemonitor and prometheus and prometheus operator optional via an overlay in model-mesh? That way, you'd have all the metrics gathering still in there, while making it possible for users who have monitoring for user defined projects enabled to skip the prometheus part and cluster metrics federation part.

I agree with the general idea. In particular, overlays would indeed be a fine choice to make the metrics stack more granular, and make it usable in differents contexts.

Some notes:

I think odh-dashboard currently relies on a dedicated prometheus instance, which it uses as kind of a "metrics centralization" point. (@LaVLaS can you confirm that ?)
IMO, we should ditch the prometheus-operator part entirely. ODH is an application, and it should not handle that, this should rely on the cluster environment (I've made an attempt to express that dependency in opendatahub-io/opendatahub-operator#206, but I'm not convinced OLM deps are the solution).

This would leave two overlays: ServiceMonitor and Prometheus.

(Additionaly, a third overlay could be patching resources to use prometheus to support envs not using the prometheus operator. That's in an ideal world though)

I'm agreed on ditching the prometheus operator completely. @VedantMahabaleshwarkar fyi^^

container and pvc metrics

i.e. what is currently handled by the /federate endpoint

'{__name__= "container_cpu_cfs_throttled_seconds_total"}'
          - '{__name__= "container_cpu_usage_seconds_total"}'
          - '{__name__= "container_memory_working_set_bytes"}'
          - '{__name__= "container_memory_rss"}'
          - '{__name__= "kubelet_volume_stats_used_bytes"}'
          - '{__name__= "kubelet_volume_stats_capacity_bytes"}'
          - '{__name__= "kube_pod_container_status_restarts_total"}'
          - '{__name__= "kube_pod_container_status_terminated_reason"}'
          - '{__name__= "kube_pod_container_status_waiting_reason"}'
          - '{__name__= "kube_pod_container_resource_limits"}'
          - '{__name__= "kube_pod_container_resource_requests"}'
          - '{__name__= "kube_pod_container_resource_requests_cpu_cores"}'
          - '{__name__= "kube_pod_container_resource_limits_cpu_cores"}'
          - '{__name__= "kube_pod_container_resource_requests_memory_bytes"}'
          - '{__name__= "kube_pod_container_resource_limits_memory_bytes"}'
          - '{__name__= "kube_pod_status_phase"}'
          - '{__name__= "kube_pod_labels"}'

kube-state-metrics (container restarts, container out of memory, all that stuff) and kubelet metrics (pvc storage usage and so on) are available via Monitoring for User Defined Projects / User Workload Monitoring out of the box.

There is a port 9092 for user-defined projects, with a namespace query parameter in the url, that allows access with Bearer Token from an odh namespace application service account.

e.g. the value of

oc serviceaccounts get-token odh-dashboard -n odh

the output of which with the prefix Bearer, e.g. Bearer dkdfskvskckskdkskcke392makaaks could be set in Authorization Header request to

https://thanos-querier.openshift-monitoring.svc.cluster.local:9092?namespace=odh

That endpoint gets you all the metrics listed above.

Note: Port 9091 in contrast would be the all-cluster-way to access for all namespaces, but that would need way more privileges.

We only get access to kube-state-metrics (contaienr restarts and so on) and kubelet-metrics (pvc usage) from our odh namespace that way. But, elegantly, we do not need any cluster-level privileges.

Router Metrics

For router metrics with labels job/service="router-internal-default", exported_namespace="odh", additional steps are necessary https://access.redhat.com/solutions/4998261

E.g. Openshift Router Metrics, e.g. haproxy_backend_http_responses_totalshould be available via a ServiceMonitor with Basic Auth, pointing to the service https://router-internal-default.openshift-ingress.svc.cluster.local:1936/metrics in openshift-ingress. statsUserName and statsPassword are in secret router-stats-default in namespace openshift-ingress. You could let our users pass these via KfDef parameters maybe, I believe Vaishnavi Hire once worked on decoding secrets values or something like that, in case putting those parameters in verbatim is not an option.

Also, a NetworkPolicy should enable traffic on port 1936 from ingress to openshift-ingress, if deny-by-default is enabled. That should be ok via the NetworkPolicy allow-from-openshift-ingress in namespace of odh. https://docs.openshift.com/container-platform/4.10/networking/network_policy/about-network-policy.html#nw-networkpolicy-about_about-network-policy

That /federation endpoint including all the roles and rolebindings from openshift-monitoring should not be necessary, as well as an own operator in odh namespace.

if you want to get cluster-level-produced metrics (namespace openshift-ingress) related to routes and (optionally) node memory and cpu utilization, you really do not need a prometheus federation of metrics from central cluster monitoring to your odh namespace. Just use what is possible with openshift monitoring for user defined projects / user workload monitoring plus the access to openshift router metrics via a simple secret as described in https://access.redhat.com/solutions/4998261.

What do you think? I think the big plus is that there are no additional prometheus instances needed and no cluster-level privileges to openshift-monitoring for custom service accounts. The current approach with a dedicated prometheus operator and it ingesting the /federate endpoint is not recommended for Openshift > 4.5 and especially not recommended for the very likely scenario of monitoring and monitoring for user defined projects enabled.

I'm fine with that kind of solution.

BUT

we can't depend exclusively on User Workload Monitoring, because when handling ODH as a managed service (in the case of RHODS), the party managing ODH and the party managing User Workload Monitoring are not the same. And other workloads can cause User Workload Monitoring instability/unavailability (by cardinality explosion of the metrics produced by one of the user workloads, for example).

So what we could do is using the approach from opendatahub-io/odh-manifests#792 to have both approaches (add and overlay which enables user-workload monitoring to the three I've put so far): we would have something like that

graph LR
   S[servicemonitors] --->|needed for|U[user-workload monitoring] & P[Prometheuses]
   P -->|needed for|O[prometheus-operator]

This approach makes sense! In the diagram above, prometheus-operator refers to a Prometheus deployed by ODH, right?

Prometheuses = one or more prometheuses instances deployed by a prometheus-operator (not necessarily one deployed by odh) prometheus-operator = the prometheus-operator deployed by odh

=> Which means we can either rely on a prometheus-operator, or deploy our own. (not sure we should, but I'm currently just translating the current state of things into a more modular setup with overlays)

@andrewballantyne

I can confirm that the haproxy / router / ingress type metrics for odh dashboard model inference, basically anything with haproxy* in front of it, can be retrieved from the router-internal-default service as described in https://access.redhat.com/solutions/4998261 Do you have access on RHODS to the secret router-stats-default in namespace openshift-ingress? Here are my results

oc get secret router-stats-default -n openshift-ingress -o go-template="{{.data.statsUsername}}" | base64 -d Output: blablablauser

oc get secret router-stats-default -n openshift-ingress -o go-template="{{.data.statsPassword}}" | base64 -d Output: blablablapassword`

Tested connectivity from odh namespace to service in openshift-ingress with a puzzle/ose3-diag container and netstat:

sh-4.2$ nmap -P0 -p 1936 router-internal-default.openshift-ingress.svc.cluster.local

Starting Nmap 6.40 ( http://nmap.org ) at 2023-04-18 10:39 UTC
Nmap scan report for router-internal-default.openshift-ingress.svc.cluster.local (172.30.163.197)
Host is up (0.00059s latency).
PORT     STATE SERVICE
1936/tcp open  unknown

Called the metrics endpoint. Important difference: in contrast to the label "exported_namespace" as does central cluster monitoring, it returns the actual label "namespace"

sh-4.2$ nmap -P0 -p 1936 router-internal-default.openshift-ingress.svc.cluster.local

Starting Nmap 6.40 ( http://nmap.org ) at 2023-04-18 10:39 UTC
Nmap scan report for router-internal-default.openshift-ingress.svc.cluster.local (172.30.163.197)
Host is up (0.00059s latency).
PORT     STATE SERVICE
1936/tcp open  unknown

Nmap done: 1 IP address (1 host up) scanned in 0.03 seconds
sh-4.2$ curl -kv -u blablauser:blablapassword https://router-internal-default.openshift-ingress.svc.cluster.local:1936/metrics
* About to connect() to router-internal-default.openshift-ingress.svc.cluster.local port 1936 (#0)
*   Trying 172.30.163.197...
* Connected to router-internal-default.openshift-ingress.svc.cluster.local (172.30.163.197) port 1936 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
* skipping SSL peer certificate verification
* NSS: client certificate not found (nickname not specified)
* SSL connection using TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
* Server certificate:
*       subject: CN=router-internal-default.openshift-ingress.svc
*       start date: Apr 18 06:43:30 2023 GMT
*       expire date: Apr 17 06:43:31 2025 GMT
*       common name: router-internal-default.openshift-ingress.svc
*       issuer: CN=openshift-service-serving-signer@1681800175
* Server auth using Basic with user 'dXNlcmZidHp0'
> GET /metrics HTTP/1.1
> Authorization: Basic ZFhObGNtWmlkSHAwOmNHRnpjMlE1YW5keQ==
> User-Agent: curl/7.29.0
> Host: router-internal-default.openshift-ingress.svc.cluster.local:1936
> Accept: */*
> 
< HTTP/1.1 200 OK
< Content-Type: text/plain; version=0.0.4; charset=utf-8
< Date: Tue, 18 Apr 2023 10:40:54 GMT
< Transfer-Encoding: chunked
< 
# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
...
# HELP haproxy_backend_http_responses_total Total of HTTP responses.
# TYPE haproxy_backend_http_responses_total gauge
haproxy_backend_http_responses_total{backend="https",code="1xx",namespace="openshift-console",route="console"} 36
haproxy_backend_http_responses_total{backend="https",code="1xx",namespace="openshift-monitoring",route="alertmanager-main"} 0
haproxy_backend_http_responses_total{backend="https",code="1xx",namespace="openshift-monitoring",route="grafana"} 0
haproxy_backend_http_responses_total{backend="https",code="1xx",namespace="openshift-monitoring",route="prometheus-k8s"} 0
haproxy_backend_http_responses_total{backend="https",code="1xx",namespace="openshift-monitoring",route="thanos-querier"} 0
haproxy_backend_http_responses_total{backend="https",code="2xx",namespace="openshift-console",route="console"} 627
haproxy_backend_http_responses_total{backend="https",code="2xx",namespace="openshift-monitoring",route="alertmanager-main"} 0
haproxy_backend_http_responses_total{backend="https",code="2xx",namespace="openshift-monitoring",route="grafana"} 0
haproxy_backend_http_responses_total{backend="https",code="2xx",namespace="openshift-monitoring",route="prometheus-k8s"} 0
haproxy_backend_http_responses_total{backend="https",code="2xx",namespace="openshift-monitoring",route="thanos-querier"} 0
haproxy_backend_http_responses_total{backend="https",code="3xx",namespace="openshift-console",route="console"} 81
haproxy_backend_http_responses_total{backend="https",code="3xx",namespace="openshift-monitoring",route="alertmanager-main"} 0
haproxy_backend_http_responses_total{backend="https",code="3xx",namespace="openshift-monitoring",route="grafana"} 0
haproxy_backend_http_responses_total{backend="https",code="3xx",namespace="openshift-monitoring",route="prometheus-k8s"} 0
haproxy_backend_http_responses_total{backend="https",code="3xx",namespace="openshift-monitoring",route="thanos-querier"} 0
haproxy_backend_http_responses_total{backend="https",code="4xx",namespace="openshift-console",route="console"} 36
haproxy_backend_http_responses_total{backend="https",code="4xx",namespace="openshift-monitoring",route="alertmanager-main"} 0
haproxy_backend_http_responses_total{backend="https",code="4xx",namespace="openshift-monitoring",route="grafana"} 0
haproxy_backend_http_responses_total{backend="https",code="4xx",namespace="openshift-monitoring",route="prometheus-k8s"} 0
haproxy_backend_http_responses_total{backend="https",code="4xx",namespace="openshift-monitoring",route="thanos-querier"} 0
haproxy_backend_http_responses_total{backend="https",code="5xx",namespace="openshift-console",route="console"} 0
haproxy_backend_http_responses_total{backend="https",code="5xx",namespace="openshift-monitoring",route="alertmanager-main"} 0
haproxy_backend_http_responses_total{backend="https",code="5xx",namespace="openshift-monitoring",route="grafana"} 0
haproxy_backend_http_responses_total{backend="https",code="5xx",namespace="openshift-monitoring",route="prometheus-k8s"} 0
haproxy_backend_http_responses_total{backend="https",code="5xx",namespace="openshift-monitoring",route="thanos-querier"} 0
haproxy_backend_http_responses_total{backend="https",code="other",namespace="openshift-console",route="console"} 0
haproxy_backend_http_responses_total{backend="https",code="other",namespace="openshift-monitoring",route="alertmanager-main"} 0
....

route and code labels are there as usual, and, as mentioned, namespace label to filter on instead of "exported_namespace"

Let me see if I can get this in to the Developer / Namespace-specific / tenancy Observe view as well via a ServiceMonitor. That way, it'd be available via Thanos-Querier in openshift-monitoring, port 9092 (tenancy/namespace-specific), which would simplify things for the odh dashboard folks.

Here is what I did, always assuming monitoring for user defined projects is enabled. This could apply to any model serving namespace where route / haproxy metrics are needed from:

Created Secret in opendatahub namespace with values from openshift-ingress secret router-stats-default, corresponding:

apiVersion: v1
kind: Secret
metadata:
  name: router-stats-default
data:
  password: statsPassword base64 encoded from secret router-stats-default in openshift-ingress
  user: statsUsername base63 encoded from secret router-stats-default in openshift-ingress
type: Opaque

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: haproxy-router-stats
spec:
  endpoints:
  - basicAuth:
      password:
        name: router-stats-default
        key: password
      username:
        name: router-stats-default
        key: user
    port: metrics
  namespaceSelector:
    matchNames:
    - openshift-ingress
  selector:
    matchLabels:
      ingresscontroller.operator.openshift.io/owning-ingresscontroller: default

Result:

haproxy Metrics available via thanos-querier port 9092 / monitoring for user defined projects, checked in developer pane / Observe. Note label is now "exported_namespace" for the route namespace. Big disadvantage: only shows in namespace openshift-ingress, NOT in user-level namespaces, so not feasible @andrewballantyne But odh-dashboard code could programmatically do what I did with curl and the username:password above if they absolutely want to have haproxy metrics in their dashboard, even when not using the federation / prometheus operator method.

Bildschirmfoto 2023-04-18 um 13 51 06

@VannTen

** Problem with monitoring for user defined projects and trying to add the metrics via a servicemonitor per user namespace:

A ServiceMonitor resource in a user-defined namespace can only discover services in the same namespace. That is, the namespaceSelector field of the ServiceMonitor resource is always ignored. Bildschirmfoto 2023-04-18 um 20 13 48

So we have to scrap that idea of getting to service

router-internal-default.openshift-ingress.svc.cluster.local:1936/metrics in openshift-ingress with means of monitoring for user defined projects alone, being most important for getting to those haproxy* metrics without cluster-level access / clusterroles / federation.**

For us in the company, we always accessed those haproxy* metrics via central cluster monitoring in openshift-monitoring when needed, not in user namespaces, so it was never a problem. Tough nut to crack without federation, if one wants those haproxy metrics inside a user-level namespace :-)

@VannTen currently, ODH Dashboard already relies on user workload monitoring already for getting kubelet / pvc usage metrics from port 9092 of thanos querier service in openshift-monitoring, which is namespace-specific, though as long as monitoring for user defined projects is not enabled, no application metrics land there, only kubelet and kube-state-metrics.

@anishasthana @VannTen @VedantMahabaleshwarkar

I am currently on and off trying out some things with the built-in monitoring of openshift (monitoring for user defined projects), something that odh-dashboard is already partly making use of in the form of connecting to

https://thanos-querier.openshift-monitoring.svc.cluster.local:9092?namespace=odh for persistent volume claim / kubelet metrics.

I have not installed the prometheus operator, only odh-operator and modelmesh-monitoring:

  - kustomizeConfig:
      parameters:
      - name: monitoring-namespace
        value: opendatahub
      overlays:
        - odh-model-controller
      repoRef:
        name: manifests
        path: model-mesh
    name: model-mesh
  - kustomizeConfig:
      parameters:
      - name: deployment-namespace
        value: opendatahub
      repoRef:
        name: manifests
        path: modelmesh-monitoring
    name: modelmesh-monitoring

One thing I already noticed in prometheus-operator container logs in namespace openshift-user-workload-monitoring with regards to the servicemonitors:

component=prometheusoperator msg="skipping servicemonitor" error="it accesses file system via bearer token file which Prometheus specification prohibits" servicemonitor=odhsven/odh-model-controller-metrics-monitor
[18:00 Uhr](https://odh-io.slack.com/archives/D053NDK4U7P/p1681833654143869)
omponent=prometheusoperator msg="skipping servicemonitor" error="it accesses file system via bearer token file which Prometheus specification prohibits" servicemonitor=odhsven/modelmesh-federated-metrics

Could you do without bearerTokenFile in the ServiceMonitors, especially with priority for

https://github.com/opendatahub-io/odh-manifests/blob/master/model-mesh/overlays/odh-model-controller/prometheus/monitor.yaml

and less important so, in the case of monitoring for user defined projects:

https://github.com/opendatahub-io/odh-manifests/blob/master/modelmesh-monitoring/base/servicemonitors/modelmesh-federated-metrics.yaml

that federated ServiceMonitor should be made optional in an overlay together with the custom prometheus, because it heavily depends on a very specific implementation.

With regards to the structure, I'd even go so far as to have one overlay for "monitoring for user defined projects" and one for "own custom prometheus including federation of metrics from central prometheus"

I'm all for using User Workload Monitoring wherever possible. It does indeed seem like the most elegant solution for a "unified monitoring stack". User Workload Monitoring will work for managed solutions, but for self managed environments if we are considering declaring an OLM dependency on the community Prometheus Operator, that might cause some issues. The last time I looked at the community operator it did not support multi-namespace installation. Do you have a workaround for that? @VannTen

I'm all for using User Workload Monitoring wherever possible. It does indeed seem like the most elegant solution for a "unified monitoring stack"

Yes, it really is. The only thing I cannot get to that way at a namespace-level are the haproxy / route metrics, those are only served at a global level (thanos-querier port 9091) vs at a namespace-level (thanos-querier port 9092 with namespace url argument).

UWM will not work for managed solutions. SRE previously rejected it

@anishasthana @andrewballantyne UWM not possible for managed solutions makes sense, you speak of RHOSDS, I assume. @VedantMahabaleshwarkar mixed up the managed and self-managed, but the thinking itself was right.

I think splitting up metrics gathering (engine plus servicemonitors) into two overlays would be best.

One overlay, rhods, that has the yaml-based prometheus operator in it including all rolebindings needed with federation plus the needed servicemonitors.

One overlay, user-workload-monitoring, that has servicemonitors only without the /federation based servicemonitor, but in a way that they work with user workload monitoring (without bearerTokenFile in yaml).

If we think this though further, odh dashboard metrics pages and logic would also need to be split between rhods and user-workload-monitoring, especially with regards to the haproxy* metrics und in general the urls used to get to metrics. I could assist there.

on ditching the prometheus operator completely

apparently, that is not as easy as it sounds on managed clusters. Though to me, using the yaml-based (not OLM-based) prometheus operator deployment is a huge hack on any Openshift. Yes, it prevents potential cardinality explosion on the monitoring stack (openshift-monitoring and openshift-user-workload-monitoring), but it also adds a component that, as do the two standard cluster prometheuses, listens to ServiceMonitors etc, creating redundancy and is not recommended by RedHat.

Though to me, using the yaml-based (not OLM-based) prometheus operator deployment is a huge hack on any Openshift

I agree, and in fact I'm even fixing a problem caused by that currently (two operators scraping the same servicemonitors). Unfortunately, I'm not sure how to do it differently. The OLM-based prometheus operator available in the community marketplace only support OwnNamespace and SingleNamespace installModes (see https://github.com/redhat-openshift-ecosystem/community-operators-prod/blob/d1f6fe6f8246733759b3e806a32f669b8d5920f4/operators/prometheus/0.56.3/manifests/prometheusoperator.0.56.3.clusterserviceversion.yaml#L459-L467) .

AFAIU, this means that it would only work on one namespace, and also that it can't be a olm-dependency of a MultiNamespace or AllNamespaces installation of another OLM bundle (https://github.com/operator-framework/operator-lifecycle-manager/issues/1777 apparently that's not supported).

One possiblity would be to contribute a PR to that prometheus-operator bundle to make it support AllNamespaces (I think that's what is used in our csv)

@VannTen @andrewballantyne @anishasthana Don't go with an additional prometheus operator approach. Instead, use user workload monitoring / monitoring for user defined projects. You can get application metrics, PVC metrics, pod metrics with that approach. i.e. monitoring for user defined projects is fed to thanos querier port 9092

https://github.com/opendatahub-io/opendatahub-community/issues/115

For route access metrics haproxy, give admins the opportunity to put a secret with info from openshift-ingress into the central opendatahub namespace where dashboard runs, so that dashboard can access router metrics, filtered by namespace. You could get the content of the needed secret in via KfDef vars.

https://github.com/opendatahub-io/opendatahub-community/issues/115

opendatahub-io / opendatahub-community

Rethink Metrics monitoring Stack in ODH #115

container and pvc metrics

Router Metrics