integr8ly / application-monitoring-operator

Operator for installing the Application Monitoring Stack on OpenShift (Prometheus, AlertManager, Grafana)
Apache License 2.0
30 stars 45 forks source link

Installation of application-monitoring-operator using full declarative language #128

Open slopezz opened 4 years ago

slopezz commented 4 years ago

Context

At 3scale engineering team, we want to use application-monitoring-operator, so both RHMI/3scale will use the same monitoring stack, which will help both teams to follow the same direction, taking into account that 3scale is working on adding metrics, prometheusRules, grafanaDashboards for the next release.

At 3scale SRE/Ops team we are using openshift hive to provision our on demand dev OCP clusters (so engineers can easily do testing with metrics, dashboards...), and we are using hive SyncSet object in order to apply same configurations to different OCP clusters (we define all resources once on a single yaml, and then we can apply the same config to any dev cluster, by just adding new clusters name to the list in the SyncSet object).

We have seen that current documented operator installation involves executing a Makefile target (with grafana/prometheus versions), which executes a bash script that executes oc apply of different files, directories or URLs.

We need an easy way to install the monitoring stack using declarative language (no Makefile target executions), so it will be easy to maintain and keep track of every change for every release on GitHub (GitOps philosophy).

Current workaround

As a workaround, what we are doing now is to parse/extract all resources deployed by scripts/install.sh and adding them to a single SyncSet object (which has a specific spec format). But before creating the SyncSet object, due to openshift hive using k8s native APIs, they don't accept for example some OpenShift apiVersion like authorization.openshift.io/v1 and need to be replaced by k8s native alternative rbac.authorization.k8s.io/v1(see issues https://github.com/openshift/hive/issues/864 and https://issues.redhat.com/browse/CO-532), so we need to fix some resources in order to be full compatible with hive:

We have checked on deploy/cluster-roles/README.md that you use Integr8ly installer in order to install application-monitoring-operator (not the Makefile target), and you don't use yamls on deploy/cluster-roles/

In order to have full compatibility with k8s (hence with openshift hive), and not require us to make transformation on almost all objects, we wonder if we can open a PR to fix those small issues, while still being fully compatible with Openshift:

Possible improvement

To make easier the installation of application-monitoring-operator using a full declarative language without having to manage all that 25 yamls, we have seen that you are already using olm-catalog, so we wonder if you plan to:

We have tried to deploy a OperatorSource using data from the Makefile (like registryNamespace: integreatly):

apiVersion: operators.coreos.com/v1
kind: OperatorSource
metadata:
  name: integreatly-operators
  namespace: openshift-marketplace
spec:
  displayName: Integreatly operators
  endpoint: https://quay.io/cnr
  publisher: integreatly
  registryNamespace: integreatly
  type: appregistry

But we have seen that only the integreatly operator is available, so we guess application-monitoring-operator might be private.

ciaran-byrne commented 4 years ago

Tagging @david-martin to get an initial opinion

david-martin commented 4 years ago
  • Either publish the operator on a current OperatorSource like certified-operators, redhat-operators or community-operators (so operator can be used by anybody)
  • Or maybe just provide on the repository an alternative installation method with a working OperatorSource resource that can be easily deployed on Openshift cluster, and then just create a Subscription object to deploy the operator on a given namespace, channel, version...

There's a couple of things happening around integreatly and monitoring in OSD4 that means I can't give a clear indication of what I think is best way forward yet. Namely:

But we have seen that only the integreatly operator is available, so we guess application-monitoring-operator might be private.

I'm not sure what's happening here, or why some element of it would be private. Perhaps @pb82 or @matskiv you may have more insight into the mechanics of OLM and how integreatly pulls in various product operator from quay?

matskiv commented 4 years ago

@david-martin integreatly-operator doesn't pull operators from Quay. We have a manifests folder which contains operator packages. These packages are then baked into integreatly-operator image and during installation they are put into a ConfigMap. This ConfigMap is referenced by CatalogSource CR, which makes OLM aware of this package and enables us to install it via Subscription.

slopezz commented 4 years ago

@matskiv Do you think you can publish the operator on one of the 3 default OperatorSources, or maybe provide the needed setup to install it using OLM (so not installing it through current Makefile)?

matskiv commented 4 years ago

@slopezz AFAIK updating one of the 3 default registries would require some manual work for each realease. But I think we can automate publishing to our own application repo. E.g. TravisCI job could be triggered on new GH release/tag. @david-martin wdyt?

slopezz commented 4 years ago

@david-martin what do you think about publishing AMO (at least on integreatly OperatorSource)?

david-martin commented 4 years ago

@slopezz At the moment, it's looking likely we'll drop AMO from the integreatly-operator (in OpenShift 4) and put AMO into maintenance mode on the v0 branch for existing Integreatly/RHMI 1.x clusters on OpenShift 3. I'll explain the thinking behind this further below.

As such, it's unlikely we'd take on the publishing of AMO to an OperatorSource. However, you are more than welcome to do this yourself if you like (can add you as a contributor no problem)

The rationale for this is a number of things:

So, right now, our intent in the shorter term is to look into solving some of the above problems in a more efficient and Operator/OLM friendly way that meets the needs of the Integreatly on OpenShift 4, and is likely to take the form of changes in the integreatly-operator rather than in AMO.

slopezz commented 4 years ago

@david-martin It makes sense, we have started using AMO because we thought it was the way to go with application monitoring following RHMI strategy.

Right now we are using it mainly for dev purpose , and found it very useful, so engineering team can easily start playing with grafana and prometheus in order to build 3scale dashboards and alerts, but we understand your current concerns about it, and definitively we think that all Red Hat products (not only Integration ones) should use the same monitoring stack, which can lead to the standardization of application monitoring.

We have done a quick test of current user workload monitoring state (Tech Preview), we don't think it is ready to be used on production at the moment, but we think that with a few improvements it can be the winner monitoring stack, at least for the prometheus part (not grafana which is out of the scope).

Here below I add our inital user workload monitoring test, so you maybe can take profit from it in case you haven't checked it yet.

At the end of the test we have added a few takeaways that we can discuss afterwards, maybe we can work together with people from Openshift monitoring team in order to provide real feedback about it to improve the product.

User Workload Monitoring Test

We have done a quick test on how user workload monitoring works, in order to check current features and viability for our usage for 3scale product (both on-prem and SaaS). We have used latest OCP 4.4.0-rc.4.

Architecture

Architecture can be check at https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/user-workload-monitoring.md :

image

1.There are two prometheus instances (each ones scraping its own ServiceMonitors...):

  1. Then there is a Thanos instance (with takes data from two previous prometheus time series databases, so has both cluster data (resources, memory/cpu usage...) and application data.

  2. Finally, any prometheus data consumer (like openshift-console metrics tab, grafana, kiali...) should take data from Thanos (the one having all data).

How to setup

Following official docs, basically it is needed to create a new configmap on namespace openshift-monitoring:

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    techPreviewUserWorkload:
      enabled: true

And immediately is created on namespace openshift-user-workload-monitoring:

Then you just need to deploy any application with metrics on any namespace, and:

On our case, we just:

Metrics

Then if you go to the Openshift Console, inside application namespace prometheus-exporter, and go to MonitoringMetrics tab, you can execute PromQL queries: image We have executed two different queries:

Both queries shows data, because we are really executing queries to Thanos instance (the one with data from both prometheus stacks).

Documentation says that as an administrator, you can go to Prometheus UI (not the embeded prometheus querier inside openshift console), by clicking on: image

And execute application PromQL queries there, but it does not work, because cluster prometheus doesn't have application data, application data is on user-workload-monitoring prometheus (or also on Thanos which has everything): image

In addition, user workload monitoring prometheus does not include a public Route to check current alerts (which personally I find useful), unlike cluster prometheus: image

Alerts

We have created a sample RedisDown PrometheusRule with fake content (redis_up == 1), so we can fire a fake alert because redis is actually up and running , showing value 1.

Then documentation says that alert should appear on the same Openshift Console, inside application namespace prometheus-exporter, and going to MonitoringAlerting tab.

But there, there is no application active Alert (only 2 active alerts from cluster prometheus) and if we look for our specific sample alert RedisDown on search box (including firing and not firing alerts), there are no alerts found: image

So we can see that unlike Metrics tab where queries are done to Thanos (which has data from both prometheus instances), on that case it seems that OpenshiftConsole/Monitoring/Alerting tab shows information only about cluster prometheus alerts.

But, if we go to AlertManage UI by clicking on: image

Here we can see both Cluster prometheus (2 alerts firing) and Application alerts (1 fake alert firing): image

So it seems that by some reason, AlertManager alerts from user-workload-monitoring, although being there, are hidden from the embedded Alerting tab on Openshift Console.

Grafana

User workload monitoring does not include grafana instance (it is out of the scope), and current cluster grafana is not an operator, is a static deployment with specific volumes mounting specific kubernetes grafana dashboards on configmaps.

So if you want to have application dashboards you need to have your own grafana instance (like integreatly grafana-operator for example, with autodiscovery of dashboards using labels).

Takeaways