openshift / cluster-monitoring-operator

Manage the OpenShift monitoring stack
Apache License 2.0
247 stars 363 forks source link

MON-3962: set proxy_from_environment to true #2431

Closed simonpasquier closed 2 months ago

simonpasquier commented 2 months ago
openshift-ci-robot commented 2 months ago

@simonpasquier: This pull request references MON-3962 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.17.0" version, but no target version was set.

In response to [this](https://github.com/openshift/cluster-monitoring-operator/pull/2431): > > >* [ ] I added CHANGELOG entry for this change. >* [ ] No user facing changes, so no entry in CHANGELOG was needed. > Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-monitoring-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
simonpasquier commented 2 months ago

/hold

It needs https://github.com/openshift/cluster-monitoring-operator/pull/2424 first.

juzhao commented 2 months ago

/retest-required

simonpasquier commented 2 months ago

/retest-required

simonpasquier commented 2 months ago

/assign @machine424

machine424 commented 2 months ago

I can see some tests failing with:

alertmanager_test.go:86: waiting for Alertmanager openshift-monitoring/main: context deadline exceeded
...
alertmanager_test.go:847: waiting for Alertmanager openshift-monitoring/main: context deadline exceeded

in https://storage.googleapis.com/test-platform-results/pr-logs/pull/openshift_cluster-monitoring-operator/2431/pull-ci-openshift-cluster-monitoring-operator-master-e2e-agnostic-operator/1820652424285655040/build-log.txt e.g.

I'm not sure if it's related to the "proxy prevents the AM cluster from forming" "issue"

AM is logging

ts=2024-08-06T04:10:18.882Z caller=main.go:181 level=info msg="Starting Alertmanager" version="(version=0.27.0, branch=master, revision=d7c1a7c6ac4b5482174797649834a47fc39d2575)"
ts=2024-08-06T04:10:18.882Z caller=main.go:182 level=info build_context="(go=go1.22.1 (Red Hat 1.22.1-1.el9) X:strictfipsruntime, platform=linux/amd64, user=root@6da8fc68d22b, date=20240604-12:44:30, tags=netgo,strictfipsruntime)"
ts=2024-08-06T04:10:18.915Z caller=cluster.go:263 level=warn component=cluster msg="failed to join cluster" err="2 errors occurred:\n\t* Failed to resolve alertmanager-main-0.alertmanager-operated:9094: lookup alertmanager-main-0.alertmanager-operated on 172.30.0.10:53: no such host\n\t* Failed to resolve alertmanager-main-1.alertmanager-operated:9094: lookup alertmanager-main-1.alertmanager-operated on 172.30.0.10:53: no such host\n\n"
ts=2024-08-06T04:10:18.915Z caller=cluster.go:265 level=info component=cluster msg="will retry joining cluster every 10s"
ts=2024-08-06T04:10:18.915Z caller=main.go:291 level=warn msg="unable to join gossip mesh" err="2 errors occurred:\n\t* Failed to resolve alertmanager-main-0.alertmanager-operated:9094: lookup alertmanager-main-0.alertmanager-operated on 172.30.0.10:53: no such host\n\t* Failed to resolve alertmanager-main-1.alertmanager-operated:9094: lookup alertmanager-main-1.alertmanager-operated on 172.30.0.10:53: no such host\n\n"
ts=2024-08-06T04:10:18.916Z caller=cluster.go:683 level=info component=cluster msg="Waiting for gossip to settle..." interval=2s
ts=2024-08-06T04:10:18.941Z caller=coordinator.go:113 level=info component=configuration msg="Loading configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml
ts=2024-08-06T04:10:18.942Z caller=coordinator.go:126 level=info component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml
ts=2024-08-06T04:10:18.944Z caller=tls_config.go:313 level=info msg="Listening on" address=127.0.0.1:9093
ts=2024-08-06T04:10:18.944Z caller=tls_config.go:352 level=info msg="TLS is disabled." http2=false address=127.0.0.1:9093
ts=2024-08-06T04:10:18.980Z caller=coordinator.go:113 level=info component=configuration msg="Loading configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml
ts=2024-08-06T04:10:18.980Z caller=coordinator.go:126 level=info component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml
ts=2024-08-06T04:10:20.916Z caller=cluster.go:708 level=info component=cluster msg="gossip not settled" polls=0 before=0 now=1 elapsed=2.000808506s
ts=2024-08-06T04:10:28.918Z caller=cluster.go:700 level=info component=cluster msg="gossip settled; proceeding" elapsed=10.002873666s
ts=2024-08-06T04:10:33.926Z caller=cluster.go:473 level=warn component=cluster msg=refresh result=failure addr=alertmanager-main-0.alertmanager-operated:9094 err="1 error occurred:\n\t* Failed to resolve alertmanager-main-0.alertmanager-operated:9094: lookup alertmanager-main-0.alertmanager-operated on 172.30.0.10:53: no such host\n\n"
ts=2024-08-06T04:10:33.931Z caller=cluster.go:473 level=warn component=cluster msg=refresh result=failure addr=alertmanager-main-1.alertmanager-operated:9094 err="1 error occurred:\n\t* Failed to resolve alertmanager-main-1.alertmanager-operated:9094: lookup alertmanager-main-1.alertmanager-operated on 172.30.0.10:53: no such host\n\n"

Let's give this another try in case.

/retest

simonpasquier commented 2 months ago

hmm I'll investigate since the failure is persistent. Not sure how to explain it though... /hold

openshift-ci-robot commented 2 months ago

@simonpasquier: This pull request references MON-3962 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.17.0" version, but no target version was set.

In response to [this](https://github.com/openshift/cluster-monitoring-operator/pull/2431): > > >* [ ] I added CHANGELOG entry for this change. >* [x] No user facing changes, so no entry in CHANGELOG was needed. > Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-monitoring-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
simonpasquier commented 2 months ago

no need to retest because we need https://github.com/prometheus-operator/prometheus-operator/pull/6818

machine424 commented 2 months ago

no need to retest because we need prometheus-operator/prometheus-operator#6818

Thanks for that. Should I understand that if the env vars are set and "proxy_from_environment": true isn't set (which https://github.com/prometheus-operator/prometheus-operator/pull/6818 is fixing now), AM will break? Or is https://github.com/prometheus-operator/prometheus-operator/pull/6818 just one of many fixes we need?

simonpasquier commented 2 months ago

Should I understand that if the env vars are set and "proxy_from_environment": true isn't set (which https://github.com/prometheus-operator/prometheus-operator/pull/6818 is fixing now), AM will break?

No. But Alertmanager will fail if a user wants to configure no_proxy or any other proxy setting other than proxy_url. Given how the operator works, it would only break when AlertmanagerConfig is enabled in CMO though.

Or is https://github.com/prometheus-operator/prometheus-operator/pull/6818 just one of many fixes we need?

This is the only change we should need.

simonpasquier commented 2 months ago

the downstream fix in the prometheus operator is https://github.com/openshift/prometheus-operator/pull/295

machine424 commented 2 months ago

For future readers:

Prometheus-operator's logs from loki were showing:

""" {"container":"prometheus-operator","host":"ip-10-0-18-233.us-west-2.compute.internal","pod":"prometheus-operator-7b7b4c5d79-27pkm","_entry":"level=error ts=2024-08-06T14:11:45.720565037Z caller=klog.go:126 component=k8s_client_runtime func=ErrorDepth msg=\"sync \\"openshift-monitoring/main\\" failed: provision alertmanager configuration: failed to initialize from secret: yaml: unmarshal errors:\n line 3: field proxy_from_environment not found in type alertmanager.httpClientConfig\""} """

simonpasquier commented 2 months ago

/retest-required

simonpasquier commented 2 months ago

/retest-required

simonpasquier commented 2 months ago

/hold cancel

simonpasquier commented 2 months ago

/assign @jan--f /assign @machine424

machine424 commented 2 months ago

/lgtm

openshift-ci[bot] commented 2 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: machine424, simonpasquier

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openshift/cluster-monitoring-operator/blob/master/OWNERS)~~ [machine424,simonpasquier] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
openshift-ci[bot] commented 2 months ago

@simonpasquier: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).
openshift-bot commented 2 months ago

[ART PR BUILD NOTIFIER]

Distgit: cluster-monitoring-operator This PR has been included in build cluster-monitoring-operator-container-v4.18.0-202408132015.p0.gb8a8b2e.assembly.stream.el9. All builds following this will include this PR.