openshift / cluster-monitoring-operator

Manage the OpenShift monitoring stack
Apache License 2.0
247 stars 363 forks source link

MON-3707: Add ipsec state metric into telemetry #2326

Closed JoshSalomon closed 6 months ago

JoshSalomon commented 6 months ago

Add the metric openshift:openshift_network_operator_ipsec_state:sum to telemetry This metric captures ipsec state of the cluster (Disabled, External or Full) and whether the state was set by the legacy API (OCP 4.14 or before) or the new API (OCP 4.15+)

openshift-ci-robot commented 6 months ago

@JoshSalomon: This pull request references MON-3707 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.16.0" version, but no target version was set.

In response to [this](https://github.com/openshift/cluster-monitoring-operator/pull/2326): >Add the metric openshift:openshift_network_operator_ipsec_state:sum to telemetry >This metric captures ipsec state of the cluster (Disabled, External or Full) and whether the state was set by the legacy API (OCP 4.14 or before) or the new API (OCP 4.15+) > > > >* [ ] I added CHANGELOG entry for this change. >* [ ] No user facing changes, so no entry in CHANGELOG was needed. > Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-monitoring-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
JoshSalomon commented 6 months ago

/retest-required

JoshSalomon commented 6 months ago

/retest-required

JoshSalomon commented 6 months ago

IIUC the goal of the metric is to report 1 for any active combination of mode/legacy API. The following expression would be more suited:

group by(mode,is_legacy_api) (openshift_network_operator_ipsec_state{namespace=~"openshift-network-operator"})

At any point in time, there is only one active combination (the code resets the gauge and then calls ipsecStateGauge.WithLabelValues once), so I don't think there is a big difference here.

Finally the metric name could be improved IMHO: openshift:openshift_network_operator_ipsec_state:sum stutters, I'd suggest cluster:openshift_network_operator_ipsec_state:info.

I agree, (with both comments) - but is it critical, the CNO PR took a long time to approve, and we need this in 4.16 - are these comments really critical to put this feature in risk for 4.16?

JoshSalomon commented 6 months ago

/retest-required

JoshSalomon commented 6 months ago

/retest-required

JoshSalomon commented 6 months ago

/retest-required

JoshSalomon commented 6 months ago

/retest-required

JoshSalomon commented 6 months ago

/retest-required

JoshSalomon commented 6 months ago

/hold

JoshSalomon commented 6 months ago

/retest-required

JoshSalomon commented 6 months ago

/retest-required

JoshSalomon commented 6 months ago

/retest-required

rexagod commented 6 months ago

At any point in time, there is only one active combination (the code resets the gauge and then calls ipsecStateGauge.WithLabelValues once), so I don't think there is a big difference here.

If the underlying code changes, a group by will ensure the value safely remains a boolean, and won't be prone to going out of bounds.

This is obvious but just to be explicit, you'll also need to s/sum/info in the source, otherwise this won't work. I'm not sure if there's a PR up for that in CNO.

rexagod commented 6 months ago

PS. The tests are failing due to internal changes, and should be fixed soon.

zshi-redhat commented 6 months ago

At any point in time, there is only one active combination (the code resets the gauge and then calls ipsecStateGauge.WithLabelValues once), so I don't think there is a big difference here.

If the underlying code changes, a group by will ensure the value safely remains a boolean, and won't be prone to going out of bounds.

This is obvious but just to be explicit, you'll also need to s/sum/info in the source, otherwise this won't work. I'm not sure if there's a PR up for that in CNO.

PR is up in CNO: https://github.com/openshift/cluster-network-operator/pull/2346 it got lgtmed by networking team. @rexagod would you mind double check?

JoshSalomon commented 6 months ago

/retest-required

JoshSalomon commented 6 months ago

/unhold

JoshSalomon commented 6 months ago

the SDN PR https://github.com/openshift/cluster-network-operator/pull/2346 was merged

rexagod commented 6 months ago

/lgtm

openshift-ci[bot] commented 6 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JoshSalomon, rexagod

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openshift/cluster-monitoring-operator/blob/master/OWNERS)~~ [rexagod] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
openshift-ci-robot commented 6 months ago

/retest-required

Remaining retests: 0 against base HEAD 5af508b31380d4b7b1a562e942559aea49b0121b and 2 for PR HEAD 499bcca13800687dcecbddecc8c717f6a75a1c00 in total

openshift-ci[bot] commented 6 months ago

@JoshSalomon: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-single-node 499bcca13800687dcecbddecc8c717f6a75a1c00 link false /test e2e-aws-ovn-single-node

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).
openshift-ci-robot commented 6 months ago

/retest-required

Remaining retests: 0 against base HEAD 98a17212947fd12f78c6d8e6d1d45775a692eae1 and 1 for PR HEAD 499bcca13800687dcecbddecc8c717f6a75a1c00 in total

rexagod commented 6 months ago

/retest-required

openshift-bot commented 6 months ago

[ART PR BUILD NOTIFIER]

This PR has been included in build cluster-monitoring-operator-container-v4.16.0-202404222343.p0.gbbde8c3.assembly.stream.el9 for distgit cluster-monitoring-operator. All builds following this will include this PR.

JoshSalomon commented 4 months ago

/cherrypick release-4.15

openshift-cherrypick-robot commented 4 months ago

@JoshSalomon: #2326 failed to apply on top of branch "release-4.15":

Applying: Add ipsec state metric into telemetry
Using index info to reconstruct a base tree...
M   Documentation/data-collection.md
M   Documentation/sample-metrics.md
M   Documentation/telemetry/telemeter_query
M   manifests/0000_50_cluster-monitoring-operator_04-config.yaml
Falling back to patching base and 3-way merge...
Auto-merging manifests/0000_50_cluster-monitoring-operator_04-config.yaml
Auto-merging Documentation/telemetry/telemeter_query
CONFLICT (content): Merge conflict in Documentation/telemetry/telemeter_query
Auto-merging Documentation/sample-metrics.md
CONFLICT (content): Merge conflict in Documentation/sample-metrics.md
Auto-merging Documentation/data-collection.md
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Add ipsec state metric into telemetry
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".
In response to [this](https://github.com/openshift/cluster-monitoring-operator/pull/2326#issuecomment-2171690529): >/cherrypick release-4.15 Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.