openshift / origin

Conformance test suite for OpenShift
http://www.openshift.org
Apache License 2.0
8.49k stars 4.7k forks source link

`oc adm router --expose-metrics` generates DC which is incompatible with the latest prom/haproxy-exporter image #15982

Closed jperville closed 6 years ago

jperville commented 7 years ago

The oc adm router --expose-metrics command generates a router DC that runs the prom/haproxy-exporter:latest image as sidecar to the main router container, as documented in https://docs.openshift.com/container-platform/3.5/install_config/router/default_haproxy_router.html#exposing-the-router-metrics .

Recently, a new version of the prom/haproxy-exporter has been released which uses incompatible way to declare its options (prefixing with double dash instead of single one). As a consequence, the oc adm router --expose-metrics can no longer use the prom/haproxy-exporter:latest safely since that version has 2 incompatible ways to pass its arguments. We must specify the image version explicitly, and adjust the entrypoint args manually.

With prom/haproxy-exporter:v0.7.1, the option to declare the scrape URI is called -haproxy.scrape-uri (with a single dash). As of prom/haproxy-exporter:v0.8.0, the same option is now prefixed with a double dash ( --haproxy.scrape-uri).

As a consequence, booting the metrics-exporter container using -haproxy.scrape-uri=http://$(STATS_USERNAME):$(STATS_PASSWORD)@localhost:$(STATS_PORT)/haproxy?stats;csv as the first entrypoint argument does not work anymore and the pod now crashes with the following message in the metrics-exporter container log:

[root@origin-centos-72 ~]# oc logs -f router-1-kkd6m -c metrics-exporter
haproxy_exporter: error: unknown short flag '-a', try --help

As a workaround, I tried to generate the DC by passing --metrics-image explicitly, but in this case the generated DC does not include entrypoint argument and the haproxy exporter image will boot but it will be unable to contact its scrape uri (since it was not passed any argument).

My final workaround was to:

Version
[root@master ~]# oc version
oc v1.5.1
kubernetes v1.5.2+43a9be4
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://192.168.33.220.xip.io:8443
openshift v1.5.1
kubernetes v1.5.2+43a9be4
Steps To Reproduce
oadm router --expose-metrics
# wait until the router pod starts
wget --no-proxy -qO- http://$(oc get pod -n default -l router=router -o jsonpath="{ .items[*].status.podIP }"):9101/metrics
Current Result

The router pod crashes in loop, then the deployment fails.

Expected Result

The router pod should start and serve metrics on port 9101.

knobunc commented 7 years ago

I suspect that now the router supports metrics, we should deprecate this option. This needs some investigation.

ingcsmoreno commented 7 years ago

I'm running a OO v1.4 cluster, and still have this issue. The "integrated" router metrics mentioned above don't work for me since they get constantly reset.

As a workaround, I deployed the router as @jperville did, then modified the router deploymentconfig and added the missing "-" at the args section. Say: from:

args:
            - >-
              -haproxy.scrape-uri=http://$(STATS_USERNAME):$(STATS_PASSWORD)@localhost:$(STATS_PORT)/haproxy?stats;csv

to:

args:
            - >-
              --haproxy.scrape-uri=http://$(STATS_USERNAME):$(STATS_PASSWORD)@localhost:$(STATS_PORT)/haproxy?stats;csv
smarterclayton commented 7 years ago

We no longer support this path (we announced deprecation with innate metrics). The recommended path going forward is to manage your router config (plus image) as a config object.

pecameron commented 6 years ago

Removing --expose-metrics and --metrics-image is tracked by: https://github.com/openshift/origin/issues/1499026

openshift-bot commented 6 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot commented 6 years ago

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale

openshift-bot commented 6 years ago

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close