openshift / cluster-monitoring-operator

Manage the OpenShift monitoring stack
Apache License 2.0
247 stars 363 forks source link

MON-3802: implement cross-namespace rules for UWM #2307

Closed simonpasquier closed 2 weeks ago

simonpasquier commented 6 months ago

This change introduces a way to deploy user-defined rules which are not scoped to their namespace of origin.

To enable the feature, a user-defined monitoring admin needs to configure at least one namespace in the UWM ConfigMap:

    kind: ConfigMap
    apiVersion: v1
    metadata:
      name: user-workload-monitoring-config
      namespace: openshift-user-workload-monitoring
    data:
      config.yaml: |-
        namespacesWithoutLabelEnforcement: [ user-monitoring-shared ]

For all PrometheusRule objects defined in the user-monitoring-shared namespace, Prometheus and Thanos Ruler evaluate the PromQL expressions without enforcing the namespace label of origin. It makes it possible to have generic rules that get applied to all (or a subset of) the user projects instead of having individual rule objects in each user project.

The capability is enabled by default but a cluster admin can decide to disable it in the CMO ConfigMap:

    kind: ConfigMap
    apiVersion: v1
    metadata:
      name: cluster-monitoring-config
      namespace: openshift-monitoring
    data:
      config.yaml: |-
        userWorkloadEnabled: true
        userWorkload:
         rulesWithoutLabelEnforcementAllowed: false

For example, a user-defined admin can create a single rule that fires when a user namespace doesn't enforce the Restricted pod security policy.

    apiVersion: monitoring.coreos.com/v1
    kind: PrometheusRule
    metadata:
      name: security
      namespace: user-monitoring-shared
    spec:
      groups:
        - name: pod-security-policy
          rules:
            - alert: "NamespaceNotEnforcingRestrictedPolicy"
              expr: kube_namespace_labels{namespace!~"(openshift|kube).*|default",label_pod_security_kubernetes_io_enforce!="restricted"}
              for: 5m
              annotations:
                summary: "Restricted policy not enforced"
                description: "Namespace {{ $labels.namespace }} doesn't enforce the Restricted pod security policy."
              labels:
                severity: warning
openshift-ci-robot commented 6 months ago

@simonpasquier: This pull request references MON-3802 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.16.0" version, but no target version was set.

In response to [this](https://github.com/openshift/cluster-monitoring-operator/pull/2307): > > >* [ ] I added CHANGELOG entry for this change. >* [ ] No user facing changes, so no entry in CHANGELOG was needed. > Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-monitoring-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
openshift-ci-robot commented 6 months ago

@simonpasquier: This pull request references MON-3802 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.16.0" version, but no target version was set.

In response to [this](https://github.com/openshift/cluster-monitoring-operator/pull/2307): > > >* [x] I added CHANGELOG entry for this change. >* [ ] No user facing changes, so no entry in CHANGELOG was needed. > Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-monitoring-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
simonpasquier commented 6 months ago

/skip

openshift-ci-robot commented 6 months ago

@simonpasquier: This pull request references MON-3802 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.16.0" version, but no target version was set.

In response to [this](https://github.com/openshift/cluster-monitoring-operator/pull/2307): >This change introduces a way to deploy user-defined rules which are not > scoped to their namespace of origin. > >To enable the feature, a user-defined monitoring admin needs to > configure at least one namespace in the UWM ConfigMap: > >```` > kind: ConfigMap > apiVersion: v1 > metadata: > name: user-workload-monitoring-config > namespace: openshift-user-workload-monitoring > data: > config.yaml: |- > namespacesWithoutLabelEnforcement: [ user-monitoring-shared ] >```` > >For all `PrometheusRule` objects defined in the `user-monitoring-shared` > namespace, Prometheus and Thanos Ruler evaluate the PromQL expressions > without enforcing the namespace label of origin. It makes it possible to > have generic rules that get applied to all (or a subset of) the user > projects instead of having individual rule objects in each user project. > > The capability is enabled by default but a cluster admin can decide to > disable it in the CMO ConfigMap: > > >```` > kind: ConfigMap > apiVersion: v1 > metadata: > name: cluster-monitoring-config > namespace: openshift-monitoring > data: > config.yaml: |- > userWorkloadEnabled: true > userWorkload: > rulesWithoutNamespaceLabelEnforcementEnabled: false >```` > > >* [ ] I added CHANGELOG entry for this change. >* [ ] No user facing changes, so no entry in CHANGELOG was needed. > Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-monitoring-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
simonpasquier commented 6 months ago

/cc @bburt-rh /cc @jan--f

openshift-ci-robot commented 6 months ago

@simonpasquier: This pull request references MON-3802 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.16.0" version, but no target version was set.

In response to [this](https://github.com/openshift/cluster-monitoring-operator/pull/2307): >This change introduces a way to deploy user-defined rules which are not > scoped to their namespace of origin. > >To enable the feature, a user-defined monitoring admin needs to > configure at least one namespace in the UWM ConfigMap: > >```` > kind: ConfigMap > apiVersion: v1 > metadata: > name: user-workload-monitoring-config > namespace: openshift-user-workload-monitoring > data: > config.yaml: |- > namespacesWithoutLabelEnforcement: [ user-monitoring-shared ] >```` > >For all `PrometheusRule` objects defined in the `user-monitoring-shared` > namespace, Prometheus and Thanos Ruler evaluate the PromQL expressions > without enforcing the namespace label of origin. It makes it possible to > have generic rules that get applied to all (or a subset of) the user > projects instead of having individual rule objects in each user project. > > The capability is enabled by default but a cluster admin can decide to > disable it in the CMO ConfigMap: > > >```` > kind: ConfigMap > apiVersion: v1 > metadata: > name: cluster-monitoring-config > namespace: openshift-monitoring > data: > config.yaml: |- > userWorkloadEnabled: true > userWorkload: > rulesWithoutLabelEnforcementAllowed: false >```` > > >* [ ] I added CHANGELOG entry for this change. >* [ ] No user facing changes, so no entry in CHANGELOG was needed. > Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-monitoring-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
openshift-ci-robot commented 6 months ago

@simonpasquier: This pull request references MON-3802 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.16.0" version, but no target version was set.

In response to [this](https://github.com/openshift/cluster-monitoring-operator/pull/2307): >This change introduces a way to deploy user-defined rules which are not > scoped to their namespace of origin. > >To enable the feature, a user-defined monitoring admin needs to > configure at least one namespace in the UWM ConfigMap: > >```` > kind: ConfigMap > apiVersion: v1 > metadata: > name: user-workload-monitoring-config > namespace: openshift-user-workload-monitoring > data: > config.yaml: |- > namespacesWithoutLabelEnforcement: [ user-monitoring-shared ] >```` > >For all `PrometheusRule` objects defined in the `user-monitoring-shared` > namespace, Prometheus and Thanos Ruler evaluate the PromQL expressions > without enforcing the namespace label of origin. It makes it possible to > have generic rules that get applied to all (or a subset of) the user > projects instead of having individual rule objects in each user project. > > The capability is enabled by default but a cluster admin can decide to > disable it in the CMO ConfigMap: > > >```` > kind: ConfigMap > apiVersion: v1 > metadata: > name: cluster-monitoring-config > namespace: openshift-monitoring > data: > config.yaml: |- > userWorkloadEnabled: true > userWorkload: > rulesWithoutLabelEnforcementAllowed: false >```` > >For example, a user-defined admin can create a single rule that fires > when a user namespace doesn't enforce the Restricted pod security > policy. > >```` > apiVersion: monitoring.coreos.com/v1 > kind: PrometheusRule > metadata: > name: security > namespace: user-monitoring-shared > spec: > groups: > - name: pod-security-policy > rules: > - alert: "NamespaceNotEnforcingRestrictedPolicy" > expr: kube_namespace_labels{namespace!~"(openshift|kube).*|default)",label_pod_security_kubernetes_io_enforce!="restricted"} > for: 5m > annotations: > summary: "Restricted policy not enforced" > description: "Namespace {{ $labels.namespace }} doesn't enforce the Restricted pod security policy." > labels: > severity: warning >```` > >* [ ] I added CHANGELOG entry for this change. >* [ ] No user facing changes, so no entry in CHANGELOG was needed. > Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-monitoring-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
simonpasquier commented 6 months ago

current status:

jan--f commented 6 months ago

As mentioned in the sync earlier, I would like to make sure we make it very clear to the user what we mean by cross-namespace. Iiuc this means user namespace as well as system namespaces. Since this currently talks about cross namespace in the context of UWM, I think we should make it explicit that system namespaces can be queried as well for such rules (if that is actually the case). I'm sure @bburt-rh would know best how to phrase that.

simonpasquier commented 6 months ago

summarizing offline discussions about the console issue

The problem is that cross-namespace rules won't be displayed in the developer console.

As an example, I've deployed a cross-namespace rule (NamespaceNotEnforcingRestrictedPolicy ) into the user-monitoring-shared namespace. The rule fires an alert for the ns1 namespace but the alert isn't visible in the dev console:

Screenshot from 2024-04-16 12-22-13

It is visible in the admin console though:

Screenshot from 2024-04-16 12-23-09

In terms of user experience, it is less than ideal since a user with only access to the ns1 project can't see the alert being active and they can't silence it if they receive an Alertmanager notification for it.

The reason behind the issue is that the console uses the /api/v1/rules endpoint exposed by prom-label-proxy which will only return alerting rules with a static namespace="<selected namespace>" label.

Possible options being discussed:

simonpasquier commented 6 months ago

/hold

simonpasquier commented 6 months ago

Modify prom-label-proxy to return any rule that matches the given namespace or that has an alert matching the given namespace. It looks like the most appropriate solution and something that also makes outside of OCP.

I tested this with https://github.com/openshift/prom-label-proxy/pull/369 and it's almost working. When clicking on the alert link to open the PromQL expression in the metrics dashboard, prom-label-proxy replies with a 400 status code and label matcher value (namespace="user-monitoring-shared") conflicts with injected value (namespace!~"(openshift|kube).*|default"). This is because prom-label-proxy runs with -error-on-replace.

Screenshot from 2024-04-17 14-46-54

Screenshot from 2024-04-17 14-46-49

openshift-ci-robot commented 4 months ago

@simonpasquier: This pull request references MON-3802 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.17.0" version, but no target version was set.

In response to [this](https://github.com/openshift/cluster-monitoring-operator/pull/2307): >This change introduces a way to deploy user-defined rules which are not > scoped to their namespace of origin. > >To enable the feature, a user-defined monitoring admin needs to > configure at least one namespace in the UWM ConfigMap: > >```` > kind: ConfigMap > apiVersion: v1 > metadata: > name: user-workload-monitoring-config > namespace: openshift-user-workload-monitoring > data: > config.yaml: |- > namespacesWithoutLabelEnforcement: [ user-monitoring-shared ] >```` > >For all `PrometheusRule` objects defined in the `user-monitoring-shared` > namespace, Prometheus and Thanos Ruler evaluate the PromQL expressions > without enforcing the namespace label of origin. It makes it possible to > have generic rules that get applied to all (or a subset of) the user > projects instead of having individual rule objects in each user project. > > The capability is enabled by default but a cluster admin can decide to > disable it in the CMO ConfigMap: > > >```` > kind: ConfigMap > apiVersion: v1 > metadata: > name: cluster-monitoring-config > namespace: openshift-monitoring > data: > config.yaml: |- > userWorkloadEnabled: true > userWorkload: > rulesWithoutLabelEnforcementAllowed: false >```` > >For example, a user-defined admin can create a single rule that fires > when a user namespace doesn't enforce the Restricted pod security > policy. > >```` > apiVersion: monitoring.coreos.com/v1 > kind: PrometheusRule > metadata: > name: security > namespace: user-monitoring-shared > spec: > groups: > - name: pod-security-policy > rules: > - alert: "NamespaceNotEnforcingRestrictedPolicy" > expr: kube_namespace_labels{namespace!~"(openshift|kube).*|default",label_pod_security_kubernetes_io_enforce!="restricted"} > for: 5m > annotations: > summary: "Restricted policy not enforced" > description: "Namespace {{ $labels.namespace }} doesn't enforce the Restricted pod security policy." > labels: > severity: warning >```` > >* [ ] I added CHANGELOG entry for this change. >* [ ] No user facing changes, so no entry in CHANGELOG was needed. > Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-monitoring-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
simonpasquier commented 2 months ago

/hold cancel

simonpasquier commented 2 months ago

/retest-required

simonpasquier commented 2 months ago

/retest-required

simonpasquier commented 2 months ago

/skip

simonpasquier commented 2 months ago

/assign @machine424

Tai-RedHat commented 2 months ago

PR tested with cluster-bot, other user-defined namespace could trigger the alert in user-monitoring-shared namespace. test case: https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-75384 /label qe-approved

openshift-ci-robot commented 2 months ago

@simonpasquier: This pull request references MON-3802 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.18.0" version, but no target version was set.

In response to [this](https://github.com/openshift/cluster-monitoring-operator/pull/2307): >This change introduces a way to deploy user-defined rules which are not > scoped to their namespace of origin. > >To enable the feature, a user-defined monitoring admin needs to > configure at least one namespace in the UWM ConfigMap: > >```` > kind: ConfigMap > apiVersion: v1 > metadata: > name: user-workload-monitoring-config > namespace: openshift-user-workload-monitoring > data: > config.yaml: |- > namespacesWithoutLabelEnforcement: [ user-monitoring-shared ] >```` > >For all `PrometheusRule` objects defined in the `user-monitoring-shared` > namespace, Prometheus and Thanos Ruler evaluate the PromQL expressions > without enforcing the namespace label of origin. It makes it possible to > have generic rules that get applied to all (or a subset of) the user > projects instead of having individual rule objects in each user project. > > The capability is enabled by default but a cluster admin can decide to > disable it in the CMO ConfigMap: > > >```` > kind: ConfigMap > apiVersion: v1 > metadata: > name: cluster-monitoring-config > namespace: openshift-monitoring > data: > config.yaml: |- > userWorkloadEnabled: true > userWorkload: > rulesWithoutLabelEnforcementAllowed: false >```` > >For example, a user-defined admin can create a single rule that fires > when a user namespace doesn't enforce the Restricted pod security > policy. > >```` > apiVersion: monitoring.coreos.com/v1 > kind: PrometheusRule > metadata: > name: security > namespace: user-monitoring-shared > spec: > groups: > - name: pod-security-policy > rules: > - alert: "NamespaceNotEnforcingRestrictedPolicy" > expr: kube_namespace_labels{namespace!~"(openshift|kube).*|default",label_pod_security_kubernetes_io_enforce!="restricted"} > for: 5m > annotations: > summary: "Restricted policy not enforced" > description: "Namespace {{ $labels.namespace }} doesn't enforce the Restricted pod security policy." > labels: > severity: warning >```` > >* [ ] I added CHANGELOG entry for this change. >* [ ] No user facing changes, so no entry in CHANGELOG was needed. > Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-monitoring-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
Tai-RedHat commented 2 months ago

If I set two or even more user-monitoring-shared namespaces, and then create same prometheusrules to each namespaces. This will result in repeated alerts. Should we remind or restrict users in this situation? config.yaml: 'namespacesWithoutLabelEnforcement: [ns1, ns2]'

Screenshot 2024-08-13 at 12 55 49
simonpasquier commented 2 months ago

If I set two or even more user-monitoring-shared namespaces, and then create same prometheusrules to each namespaces. This will result in repeated alerts. Should we remind or restrict users in this situation?

Thanks for the testing @Tai-RedHat! I don't see a strong reason to prevent this situation (and it would be quite hard to detect). But it would be good to mention in the docs.

machine424 commented 1 month ago

I've just realized this is assigned to me, I'll take a look.

simonpasquier commented 1 month ago

From https://github.com/openshift/cluster-monitoring-operator/pull/2307#discussion_r1734441777

(maybe it's "safer" to have RulesWithoutLabelEnforcementAllowed disabled by default.)

My initial intention was to avoid friction in adopting this feature but I'm also ok making it opt-in as it's less surprising for platform admins. We can also keep it opt-in for a few releases and then turn it on by default.

@jan--f WDYT?

machine424 commented 2 weeks ago

/lgtm You'll make many users happy with this.

openshift-ci-robot commented 2 weeks ago

/retest-required

Remaining retests: 0 against base HEAD e6e76f7d844cc430b8be9ce1a9314d2013faa7b6 and 2 for PR HEAD eaf43fea22ae56557b00bef318de3e52ef4bea8f in total

openshift-ci-robot commented 2 weeks ago

/retest-required

Remaining retests: 0 against base HEAD e6e76f7d844cc430b8be9ce1a9314d2013faa7b6 and 2 for PR HEAD eaf43fea22ae56557b00bef318de3e52ef4bea8f in total

machine424 commented 2 weeks ago

/lgtm

slashpai commented 2 weeks ago

/lgtm

openshift-ci[bot] commented 2 weeks ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: machine424, simonpasquier, slashpai

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openshift/cluster-monitoring-operator/blob/master/OWNERS)~~ [machine424,simonpasquier,slashpai] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
openshift-ci-robot commented 2 weeks ago

/retest-required

Remaining retests: 0 against base HEAD 04dbe83e4b4d6b576aa2e14fdbadbd1de3ea2016 and 2 for PR HEAD 6adb5214ee2a8b4a8e143db121471ac2878b74e9 in total