Closed simonpasquier closed 2 months ago
@simonpasquier: This pull request references MON-3962 which is a valid jira issue.
Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.17.0" version, but no target version was set.
/hold
It needs https://github.com/openshift/cluster-monitoring-operator/pull/2424 first.
/retest-required
/retest-required
/assign @machine424
I can see some tests failing with:
alertmanager_test.go:86: waiting for Alertmanager openshift-monitoring/main: context deadline exceeded
...
alertmanager_test.go:847: waiting for Alertmanager openshift-monitoring/main: context deadline exceeded
I'm not sure if it's related to the "proxy prevents the AM cluster from forming" "issue"
AM is logging
ts=2024-08-06T04:10:18.882Z caller=main.go:181 level=info msg="Starting Alertmanager" version="(version=0.27.0, branch=master, revision=d7c1a7c6ac4b5482174797649834a47fc39d2575)"
ts=2024-08-06T04:10:18.882Z caller=main.go:182 level=info build_context="(go=go1.22.1 (Red Hat 1.22.1-1.el9) X:strictfipsruntime, platform=linux/amd64, user=root@6da8fc68d22b, date=20240604-12:44:30, tags=netgo,strictfipsruntime)"
ts=2024-08-06T04:10:18.915Z caller=cluster.go:263 level=warn component=cluster msg="failed to join cluster" err="2 errors occurred:\n\t* Failed to resolve alertmanager-main-0.alertmanager-operated:9094: lookup alertmanager-main-0.alertmanager-operated on 172.30.0.10:53: no such host\n\t* Failed to resolve alertmanager-main-1.alertmanager-operated:9094: lookup alertmanager-main-1.alertmanager-operated on 172.30.0.10:53: no such host\n\n"
ts=2024-08-06T04:10:18.915Z caller=cluster.go:265 level=info component=cluster msg="will retry joining cluster every 10s"
ts=2024-08-06T04:10:18.915Z caller=main.go:291 level=warn msg="unable to join gossip mesh" err="2 errors occurred:\n\t* Failed to resolve alertmanager-main-0.alertmanager-operated:9094: lookup alertmanager-main-0.alertmanager-operated on 172.30.0.10:53: no such host\n\t* Failed to resolve alertmanager-main-1.alertmanager-operated:9094: lookup alertmanager-main-1.alertmanager-operated on 172.30.0.10:53: no such host\n\n"
ts=2024-08-06T04:10:18.916Z caller=cluster.go:683 level=info component=cluster msg="Waiting for gossip to settle..." interval=2s
ts=2024-08-06T04:10:18.941Z caller=coordinator.go:113 level=info component=configuration msg="Loading configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml
ts=2024-08-06T04:10:18.942Z caller=coordinator.go:126 level=info component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml
ts=2024-08-06T04:10:18.944Z caller=tls_config.go:313 level=info msg="Listening on" address=127.0.0.1:9093
ts=2024-08-06T04:10:18.944Z caller=tls_config.go:352 level=info msg="TLS is disabled." http2=false address=127.0.0.1:9093
ts=2024-08-06T04:10:18.980Z caller=coordinator.go:113 level=info component=configuration msg="Loading configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml
ts=2024-08-06T04:10:18.980Z caller=coordinator.go:126 level=info component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml
ts=2024-08-06T04:10:20.916Z caller=cluster.go:708 level=info component=cluster msg="gossip not settled" polls=0 before=0 now=1 elapsed=2.000808506s
ts=2024-08-06T04:10:28.918Z caller=cluster.go:700 level=info component=cluster msg="gossip settled; proceeding" elapsed=10.002873666s
ts=2024-08-06T04:10:33.926Z caller=cluster.go:473 level=warn component=cluster msg=refresh result=failure addr=alertmanager-main-0.alertmanager-operated:9094 err="1 error occurred:\n\t* Failed to resolve alertmanager-main-0.alertmanager-operated:9094: lookup alertmanager-main-0.alertmanager-operated on 172.30.0.10:53: no such host\n\n"
ts=2024-08-06T04:10:33.931Z caller=cluster.go:473 level=warn component=cluster msg=refresh result=failure addr=alertmanager-main-1.alertmanager-operated:9094 err="1 error occurred:\n\t* Failed to resolve alertmanager-main-1.alertmanager-operated:9094: lookup alertmanager-main-1.alertmanager-operated on 172.30.0.10:53: no such host\n\n"
Let's give this another try in case.
/retest
hmm I'll investigate since the failure is persistent. Not sure how to explain it though... /hold
@simonpasquier: This pull request references MON-3962 which is a valid jira issue.
Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.17.0" version, but no target version was set.
no need to retest because we need https://github.com/prometheus-operator/prometheus-operator/pull/6818
no need to retest because we need prometheus-operator/prometheus-operator#6818
Thanks for that.
Should I understand that if the env vars are set and "proxy_from_environment": true
isn't set (which https://github.com/prometheus-operator/prometheus-operator/pull/6818 is fixing now), AM will break?
Or is https://github.com/prometheus-operator/prometheus-operator/pull/6818 just one of many fixes we need?
Should I understand that if the env vars are set and "proxy_from_environment": true isn't set (which https://github.com/prometheus-operator/prometheus-operator/pull/6818 is fixing now), AM will break?
No. But Alertmanager will fail if a user wants to configure no_proxy or any other proxy setting other than proxy_url. Given how the operator works, it would only break when AlertmanagerConfig is enabled in CMO though.
Or is https://github.com/prometheus-operator/prometheus-operator/pull/6818 just one of many fixes we need?
This is the only change we should need.
the downstream fix in the prometheus operator is https://github.com/openshift/prometheus-operator/pull/295
For future readers:
Prometheus-operator's logs from loki were showing:
""" {"container":"prometheus-operator","host":"ip-10-0-18-233.us-west-2.compute.internal","pod":"prometheus-operator-7b7b4c5d79-27pkm","_entry":"level=error ts=2024-08-06T14:11:45.720565037Z caller=klog.go:126 component=k8s_client_runtime func=ErrorDepth msg=\"sync \\"openshift-monitoring/main\\" failed: provision alertmanager configuration: failed to initialize from secret: yaml: unmarshal errors:\n line 3: field proxy_from_environment not found in type alertmanager.httpClientConfig\""} """
/retest-required
/retest-required
/hold cancel
/assign @jan--f /assign @machine424
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: machine424, simonpasquier
The full list of commands accepted by this bot can be found here.
The pull request process is described here
@simonpasquier: all tests passed!
Full PR test history. Your PR dashboard.
[ART PR BUILD NOTIFIER]
Distgit: cluster-monitoring-operator This PR has been included in build cluster-monitoring-operator-container-v4.18.0-202408132015.p0.gb8a8b2e.assembly.stream.el9. All builds following this will include this PR.