Open slopezz opened 4 years ago
Tagging @david-martin to get an initial opinion
- Either publish the operator on a current
OperatorSource
like certified-operators, redhat-operators or community-operators (so operator can be used by anybody)- Or maybe just provide on the repository an alternative installation method with a working
OperatorSource
resource that can be easily deployed on Openshift cluster, and then just create aSubscription
object to deploy the operator on a given namespace, channel, version...
There's a couple of things happening around integreatly and monitoring in OSD4 that means I can't give a clear indication of what I think is best way forward yet. Namely:
But we have seen that only the
integreatly
operator is available, so we guessapplication-monitoring-operator
might be private.
I'm not sure what's happening here, or why some element of it would be private. Perhaps @pb82 or @matskiv you may have more insight into the mechanics of OLM and how integreatly pulls in various product operator from quay?
@david-martin integreatly-operator doesn't pull operators from Quay. We have a manifests folder which contains operator packages. These packages are then baked into integreatly-operator image and during installation they are put into a ConfigMap. This ConfigMap is referenced by CatalogSource CR, which makes OLM aware of this package and enables us to install it via Subscription.
@matskiv Do you think you can publish the operator on one of the 3 default OperatorSources, or maybe provide the needed setup to install it using OLM (so not installing it through current Makefile)?
@slopezz AFAIK updating one of the 3 default registries would require some manual work for each realease. But I think we can automate publishing to our own application repo. E.g. TravisCI job could be triggered on new GH release/tag. @david-martin wdyt?
@david-martin what do you think about publishing AMO (at least on integreatly OperatorSource)?
@slopezz At the moment, it's looking likely we'll drop AMO from the integreatly-operator (in OpenShift 4) and put AMO into maintenance mode on the v0 branch for existing Integreatly/RHMI 1.x clusters on OpenShift 3. I'll explain the thinking behind this further below.
As such, it's unlikely we'd take on the publishing of AMO to an OperatorSource. However, you are more than welcome to do this yourself if you like (can add you as a contributor no problem)
The rationale for this is a number of things:
So, right now, our intent in the shorter term is to look into solving some of the above problems in a more efficient and Operator/OLM friendly way that meets the needs of the Integreatly on OpenShift 4, and is likely to take the form of changes in the integreatly-operator rather than in AMO.
@david-martin It makes sense, we have started using AMO because we thought it was the way to go with application monitoring following RHMI strategy.
Right now we are using it mainly for dev purpose , and found it very useful, so engineering team can easily start playing with grafana and prometheus in order to build 3scale dashboards and alerts, but we understand your current concerns about it, and definitively we think that all Red Hat products (not only Integration ones) should use the same monitoring stack, which can lead to the standardization of application monitoring.
We have done a quick test of current user workload monitoring
state (Tech Preview), we don't think it is ready to be used on production at the moment, but we think that with a few improvements it can be the winner monitoring stack, at least for the prometheus part (not grafana which is out of the scope).
Here below I add our inital user workload monitoring
test, so you maybe can take profit from it in case you haven't checked it yet.
At the end of the test we have added a few takeaways that we can discuss afterwards, maybe we can work together with people from Openshift monitoring team in order to provide real feedback about it to improve the product.
We have done a quick test on how user workload monitoring works, in order to check current features and viability for our usage for 3scale product (both on-prem and SaaS). We have used latest OCP 4.4.0-rc.4.
Architecture can be check at https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/user-workload-monitoring.md :
1.There are two prometheus instances (each ones scraping its own ServiceMonitors...):
Cluster Monitoring
: scraping kube-state-metrics, kubelet, etcd...User Monitoring
: scraping application metricsAlertManager
Then there is a Thanos
instance (with takes data from two previous prometheus time series databases, so has both cluster data (resources, memory/cpu usage...) and application data.
Finally, any prometheus data consumer (like openshift-console metrics tab, grafana, kiali...) should take data from Thanos (the one having all data).
Following official docs, basically it is needed to create a new configmap on namespace openshift-monitoring
:
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-monitoring-config
namespace: openshift-monitoring
data:
config.yaml: |
techPreviewUserWorkload:
enabled: true
And immediately is created on namespace openshift-user-workload-monitoring
:
user-workload
, creating finally an statefulset prometheus-user-workload
with 2 pods.
$ oc get pods -n openshift-user-workload-monitoring
NAME READY STATUS RESTARTS AGE
prometheus-operator-55569f49-6sfnh 1/1 Running 0 5h34m
prometheus-user-workload-0 5/5 Running 1 5h34m
prometheus-user-workload-1 5/5 Running 1 5h34m
Then you just need to deploy any application with metrics on any namespace, and:
ServiceMonitor
object to scrape app metrics (not need to add any label)PrometheusRule
objects with alerts (not need to add any label)On our case, we just:
prometheus-exporter
namespaceServiceMonitor
is created monitoring that sample redis databaseThen if you go to the Openshift Console
, inside application namespace prometheus-exporter
, and go to Monitoring
→ Metrics
tab, you can execute PromQL queries:
We have executed two different queries:
redis_up
(whose data is located on user workload prometheus)Both queries shows data, because we are really executing queries to Thanos
instance (the one with data from both prometheus stacks).
Documentation says that as an administrator, you can go to Prometheus UI
(not the embeded prometheus querier inside openshift console), by clicking on:
And execute application PromQL queries there, but it does not work, because cluster prometheus doesn't have application data, application data is on user-workload-monitoring prometheus (or also on Thanos
which has everything):
In addition, user workload monitoring prometheus does not include a public Route to check current alerts (which personally I find useful), unlike cluster prometheus:
We have created a sample RedisDown
PrometheusRule with fake content (redis_up == 1
), so we can fire a fake alert because redis is actually up and running , showing value 1
.
Then documentation says that alert should appear on the same Openshift Console
, inside application namespace prometheus-exporter
, and going to Monitoring
→ Alerting
tab.
But there, there is no application active Alert (only 2 active alerts from cluster prometheus) and if we look for our specific sample alert RedisDown
on search box (including firing and not firing alerts), there are no alerts found:
So we can see that unlike Metrics
tab where queries are done to Thanos
(which has data from both prometheus instances), on that case it seems that OpenshiftConsole/Monitoring/Alerting
tab shows information only about cluster prometheus alerts.
But, if we go to AlertManage UI
by clicking on:
Here we can see both Cluster prometheus (2 alerts firing) and Application alerts (1 fake alert firing):
So it seems that by some reason, AlertManager
alerts from user-workload-monitoring, although being there, are hidden from the embedded Alerting
tab on Openshift Console.
User workload monitoring does not include grafana instance (it is out of the scope), and current cluster grafana is not an operator, is a static deployment with specific volumes mounting specific kubernetes grafana dashboards on configmaps.
So if you want to have application dashboards you need to have your own grafana instance (like integreatly grafana-operator for example, with autodiscovery of dashboards using labels).
user workload monitoring
, although being in Tech Preview (and with what seems a few non working features at the moment), can be the answer to application monitoring standardization accross all Red Hat (no need to manage any extra prometheus instance).Console Metrics
tab works OK (querying Thanos
, which includes cluster and application data), but there is no public PrometheusUserWorkloadMonitoring Route in order to be able to check user workload monitoring prometheus console (with configuration, alerts, targets...) which as a SRE, I find it very useful.Console Alerting
tab only shows cluster prometheus alerts (although in fact, AlertManager
has alerts from both prometheus, as we could see directly on AlertManager UI
).Thanos
as datasource (the one with cluster and application data), so there is no need to use a secondary app-prometheus federating with cluster-prometheus).
Context
At 3scale engineering team, we want to use application-monitoring-operator, so both RHMI/3scale will use the same monitoring stack, which will help both teams to follow the same direction, taking into account that 3scale is working on adding metrics, prometheusRules, grafanaDashboards for the next release.
At 3scale SRE/Ops team we are using openshift hive to provision our on demand dev OCP clusters (so engineers can easily do testing with metrics, dashboards...), and we are using hive SyncSet object in order to apply same configurations to different OCP clusters (we define all resources once on a single yaml, and then we can apply the same config to any dev cluster, by just adding new clusters name to the list in the
SyncSet
object).We have seen that current documented operator installation involves executing a Makefile target (with grafana/prometheus versions), which executes a bash script that executes
oc apply
of different files, directories or URLs.We need an easy way to install the monitoring stack using declarative language (no Makefile target executions), so it will be easy to maintain and keep track of every change for every release on GitHub (GitOps philosophy).
Current workaround
As a workaround, what we are doing now is to parse/extract all resources deployed by scripts/install.sh and adding them to a single
SyncSet
object (which has a specific spec format). But before creating theSyncSet
object, due to openshift hive using k8s native APIs, they don't accept for example some OpenShift apiVersion likeauthorization.openshift.io/v1
and need to be replaced by k8s native alternativerbac.authorization.k8s.io/v1
(see issues https://github.com/openshift/hive/issues/864 and https://issues.redhat.com/browse/CO-532), so we need to fix some resources in order to be full compatible with hive:We update on some ClusterRole/ClusterRoleBinding resources, apiVersion from OpenShift
authorization.openshift.io/v1
to k8srbac.authorization.k8s.io/v1
(plus applying some additions like adding roleRef.king and roleRef.Group), actually you are already using that k8s native apiVersion on other ClusterRole/ClusterRolebinding objects (but not on all), example:apiGroup: rbac.authorization.k8s.io
kind: ClusterRole name: alertmanager-application-monitoring subjects:
We add the namespace name on each non-cluster-scope object (like operator deployment, service_account, role, role_binding, and applicationmonitoring example), actually you are already harcoding the namespace on some other Objects like ClusterRolebinding, example:
namespace: application-monitoring spec: labelSelector: "middleware" additionalScrapeConfigSecretName: "integreatly-additional-scrape-configs"
Finally, we fix an error on namespace name on ClusterRoleBinding
grafana-proxy
resource:namespace: monitoring2
namespace: application-monitoring userNames:
We have checked on deploy/cluster-roles/README.md that you use
Integr8ly installer
in order to install application-monitoring-operator (not the Makefile target), and you don't use yamls on deploy/cluster-roles/In order to have full compatibility with k8s (hence with openshift hive), and not require us to make transformation on almost all objects, we wonder if we can open a PR to fix those small issues, while still being fully compatible with Openshift:
grafana-proxy
Possible improvement
To make easier the installation of
application-monitoring-operator
using a full declarative language without having to manage all that 25 yamls, we have seen that you are already using olm-catalog, so we wonder if you plan to:OperatorSource
like certified-operators, redhat-operators or community-operators (so operator can be used by anybody)OperatorSource
resource that can be easily deployed on Openshift cluster, and then just create aSubscription
object to deploy the operator on a given namespace, channel, version...We have tried to deploy a
OperatorSource
using data from the Makefile (likeregistryNamespace: integreatly
):But we have seen that only the
integreatly
operator is available, so we guessapplication-monitoring-operator
might be private.