Open Duologic opened 4 years ago
I have looked at the code a bit and came to a few conclusions.
First of all, I think the MustRegister
idea @Duologic has described above is a red herring. I don't think the registration ever fails (even if DescribeByCollect
is used – if the collection that is triggered by the registration fails, it will just lead to incomplete registration, which will ultimately not cause any problems because the registry used is not a pedantic registry).
I think the actual problem is the following:
The gcp-quota-exporter exposes its own up
metric (https://github.com/mintel/gcp-quota-exporter/blob/master/main.go#L28). This is super weird as it prevents Prometheus from adding the usual synthetic up
metric. According to https://prometheus.io/docs/instrumenting/writing_exporters/#failed-scrapes , this should be a metric call gcp_quota_up
rather than just up
. This would already help to not trigger alerts about the target flapping. However, one might also argue that there is not much useful metrics to return if the GCP API times out, so arguably the exporter should return a 5xx and fail the entire scrape. This would again trigger the target-flapping alerts, but now in a "proper" way. In either case, the problem is that the GCP API seems quite unreliable in general, so flapping is inevitable and in any case creates weirdly interrupted time series about the quotas.
Which leads me to the solution I'd like to suggest here: Since quota information is changing only very very slowly, but retrieving it is so exceedingly unreliable, I think it is OK to cache the result and return a cached version if the retrieval is failing. Perhaps then add a metric that contains the timestamp of the last retrieval, so that you can detect overly stale quota information.
https://prometheus.io/docs/instrumenting/writing_exporters/#scheduling has some best practices about this topic, but it doesn't really cover the scenario "usually fast enough for synchronously pulling the source metrics but quite often taking too long".
In our particular case, just setting a much higher timeout and higher retry count might already help, but in any case, exposing a up
metric is very confusing and should be changed to gcp_quota_up
at the very least.
Thanks @beorn7.
Which leads me to the solution I'd like to suggest here: Since quota information is changing only very very slowly, but retrieving it is so exceedingly unreliable, I think it is OK to cache the result and return a cached version if the retrieval is failing. Perhaps then add a metric that contains the timestamp of the last retrieval, so that you can detect overly stale quota information.
+1 on the caching, it should mitigate the target-flapping completely.
Beorn has configured these values, which prevent it from flapping now:
GCP_EXPORTER_MAX_RETRIES=10
GCP_EXPORTER_HTTP_TIMEOUT=2m
The gcp-quota-exporter exposes its own up metric (https://github.com/mintel/gcp-quota-exporter/blob/master/main.go#L28). This is super weird as it prevents Prometheus from adding the usual synthetic up metric.
It make sense, since we scrape 2 different api ( project and region ) we could present 2 gcp_quota_XXX_up
metric and leave the main up
one to the prometheus client library
Since quota information is changing only very very slowly, but retrieving it is so exceedingly unreliable, I think it is OK to cache the result and return a cached version if the retrieval is failing. Perhaps then add a metric that contains the timestamp of the last retrieval, so that you can detect overly stale quota information.
I am not so sure about this. An issue on the scraper side would look exactly like and issue on the google API side, and it could last potentially for long times.
Sure you could use the last successful scrape
timestamp metric to alert if it last too long ... but is it not just complicating things ?
We would be returning an up=1
when the scrape has actually failed ... that does not seem right to me.
In our particular case, just setting a much higher timeout and higher retry count might already help, but in any case, exposing a up metric is very confusing and should be changed to gcp_quota_up at the very least. Beorn has configured these values, which prevent it from flapping now:
This seems like a much better solution to me and we could definetely update the default values to something more reasonable.
We never really did too much investigation into the right
values to be honest but i think we should probably aim for a higher retry count
I also wonder if our use of the rehttp is actually correct to handle retry on timeout (note this was mostly a copy of the stackdriver exporter ) but https://github.com/mintel/gcp-quota-exporter/blob/master/main.go#L129 , where we set the status to 503
... so it might simply not even retry on timeout at all with this default configuration.
More investigation needed ! but thanks for the work
given that defaultClient
just expect an http.Client
we might try to use https://github.com/hashicorp/go-retryablehttp which is much easier and cleaner and definetely retry on client errors
In our particular case, just setting a much higher timeout and higher retry count might already help
We need to be careful to have default values that match default prometheus scraping timeout
or the scraper might timeout before the exporter does
@beorn7 the issue with the up
series should be handled https://github.com/mintel/gcp-quota-exporter/pull/11
With our manually increased timeouts and scrape interval, we are faring quite well so far.
I guess it's best for now to not complicate things and refrain from caching for the time being.
@primeroz As beorn mentioned, we are faring quite well, there is no reason for me to keep this issue open any longer. WDYT?
I wanted to have a better look at the retry logic which I don't think is working as expected.
So let's keep this open and I will close it when I am back from holiday ?
We monitor our Prometheus scraping and this one causes a bit of noise as the scrape fails every now and then.
Error observed shows that the API is just a bit unresponsive:
Solution with least amount of work I can come up with: swap
MustRegister
forRegister
and don't panic if the collector fails. I think it should suffice for the purpose of collecting the quota data, we can be fairly sure that if the quota API is not available the quota's are still there. :-) Only thing the go docs were not clear on, if the collector fails withRegister
, does it expose the last known metric or nothing at all?Another solution (would be more work, I use this pattern in another exporter), we could simply run a loop that gathers the metrics (no
Register/MustRegister
) and havehttp.ListenAndServe
in a go routine. This just keeps exposing the data as last seen instead of nothing (for the purpose of colecting metrics on quotas or events, this should suffice). It would also simplify code a lot as the framework for the Collector is not needed for this.WDYT?