opendatahub-io / architecture-decision-records

Collection of Architectural Decision Records
Apache License 2.0
16 stars 36 forks source link

ADR for mlserving UWM #13

Closed VedantMahabaleshwarkar closed 4 months ago

VedantMahabaleshwarkar commented 1 year ago

@etirelli @danielezonca @jgarciao @andrewballantyne @cfchase Can I get a review on this ADR?

strangiato commented 1 year ago

So a few thoughts...

ODH/RHODS should not be responsible for enabling end user workload monitoring on the cluster. The operator should not make changes to cluster level configurations, and it is the responsibility of the cluster admin to enable this feature. Enabling the end user workload monitoring feature without the cluster admin being made aware could lead to degraded performance in the cluster prometheus instance or worse.

The requirement for end user workload monitoring should be well documented, and we should leverage official documentation on how to enable the feature, similar to how we document the requirements around the GPU/NFD operators.

The dashboard should make it clear when this feature is not enabled and display a message recommending the cluster admin enable it to take advantage of the monitoring capabilities or setup a separate prometheus instance, which leads to....

ODH should still enable the use of any prometheus instance. End user workload monitoring is a great feature, but it is openshift specific. In order to keep the project at least possible to run on other k8s clusters or a number of architectural designs that may require us to not use end user workload monitoring we should still enable using another prometheus instance.

From a dashboard perspective I would expect this to mean that an admin should be able to configure a connection to a prometheus instance if they do not want to utilize end user workload monitoring. In this use case I would expect the end user to be 100% responsible for installing and managing the prometheus instance and they would be required to setup the connection data in ODH to allow the dashboard to read the data. In the case we are using end user workload monitoring this connection could be setup for users automatically.

One final thought is that as we are working on setting up Pod/Service Monitoring objects, these should not be created by the dashboard when we create a ModelMesh object. These objects should be managed by the ModelMesh controller instead and a child of the ModelMesh instance.

andrewballantyne commented 1 year ago

This was a great informative comment @strangiato!

Today, our ODH/RHODS stack expects it to be on an OpenShift instance -- if that was not the desired case, I think there are several architectural gaps in the way Dashboard reaches out to look for k8s resources that are CRDs from OpenShift.

Worth noting, the Dashboard creates nothing in the monitoring stack and I do not expect that to change. But you made a good point about the Dashboard should have configurations for setting up an endpoint for where to get metrics. Part of this effort, we should look to move to a system of configuration for Model Serving metrics. Having a backing OpenShift Route (name & namespace) -- or if there is a better mechanism -- to where to send requests for monitoring aspects.

If UWM is optional, we need to also consider what we are doing with the PVC "filled state" value calls we do on DS Projects.

VedantMahabaleshwarkar commented 1 year ago

ODH/RHODS should not be responsible for enabling end user workload monitoring on the cluster. The operator should not make changes to cluster level configurations, and it is the responsibility of the cluster admin to enable this feature. Enabling the end user workload monitoring feature without the cluster admin being made aware could lead to degraded performance in the cluster prometheus instance or worse. The requirement for end user workload monitoring should be well documented, and we should leverage official documentation on how to enable the feature, similar to how we document the requirements around the GPU/NFD operators. The dashboard should make it clear when this feature is not enabled and display a message recommending the cluster admin enable it to take advantage of the monitoring capabilities or setup a separate prometheus instance

@strangiato fully agree with your points. I had originally thought of the same concerns, hence the open questions section of the ADR. We need to finalize a plan to document the requirement for UWM in MLServing Metrics, have some sort of awareness for all types of users (possibly in the dashboard? similar to the warning we show for DSPO if the pipelines operator isn't installed). I also agree that it shouldn't be the rhods-admin solely looking at UWM but it should be in collaboration with the cluster admin.

ODH should still enable the use of any prometheus instance.

This is increasingly becoming a difficult use-case to support as managed openshift continues to move towards an approach of "No custom prometheuses please". Continuing to support this will require us to maintain multiple implementations for ODH and the downstream product, which needs further discussion if that's the route we want to take.

@andrewballantyne I think having a configurable endpoint setup for metrics would be the best case scenario, but I'm not sure how practically achievable that is in terms of efforts. For now, what do you think about having some sort of detection mechanism to see if UWM is enabled? If we detect that it isn't enabled we could show a simple error on the metrics page that can inform the user to talk to rhods-admin + cluster-admin to enable UWM.

ldimaggi commented 1 year ago

Will this support disconnected configurations?

jgarciao commented 1 year ago

Will this support disconnected configurations?

@VedantMahabaleshwarkar I think this should be a requirement for the proposed solution. I've proposed it as a requirement in the design document

I've also proposed a requirement about compatibility with RHODS component dependencies (OpenShift Service Mesh, OpenShift Pipelines, …)", because the UWM documentation says

and I don't know if RHODS dependencies deploy custom prometheus instances

github-actions[bot] commented 4 months ago

This PR is stale because it has been open 21 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 4 months ago

This PR was closed because it has been stale for 21+7 days with no activity.