mintel / gcp-quota-exporter

Exports quota limits and usage for Google Compute Platform services
MIT License
7 stars 14 forks source link

Failure when querying region quotas #10

Open Duologic opened 4 years ago

Duologic commented 4 years ago

We monitor our Prometheus scraping and this one causes a bit of noise as the scrape fails every now and then.

Error observed shows that the API is just a bit unresponsive:

time="2020-07-22T06:35:55Z" level=error msg="Failure when querying region quotas: Get https://www.googleapis.com/compute/v1/projects/<project>/regions?alt=json&prettyPrint=false: net/http: request canceled (Client.Timeout exceeded while awaiting headers)" source="main.go:73"

Solution with least amount of work I can come up with: swap MustRegister for Register and don't panic if the collector fails. I think it should suffice for the purpose of collecting the quota data, we can be fairly sure that if the quota API is not available the quota's are still there. :-) Only thing the go docs were not clear on, if the collector fails with Register, does it expose the last known metric or nothing at all?

Another solution (would be more work, I use this pattern in another exporter), we could simply run a loop that gathers the metrics (no Register/MustRegister) and have http.ListenAndServe in a go routine. This just keeps exposing the data as last seen instead of nothing (for the purpose of colecting metrics on quotas or events, this should suffice). It would also simplify code a lot as the framework for the Collector is not needed for this.

WDYT?

beorn7 commented 4 years ago

I have looked at the code a bit and came to a few conclusions.

First of all, I think the MustRegister idea @Duologic has described above is a red herring. I don't think the registration ever fails (even if DescribeByCollect is used – if the collection that is triggered by the registration fails, it will just lead to incomplete registration, which will ultimately not cause any problems because the registry used is not a pedantic registry).

I think the actual problem is the following:

The gcp-quota-exporter exposes its own up metric (https://github.com/mintel/gcp-quota-exporter/blob/master/main.go#L28). This is super weird as it prevents Prometheus from adding the usual synthetic up metric. According to https://prometheus.io/docs/instrumenting/writing_exporters/#failed-scrapes , this should be a metric call gcp_quota_up rather than just up. This would already help to not trigger alerts about the target flapping. However, one might also argue that there is not much useful metrics to return if the GCP API times out, so arguably the exporter should return a 5xx and fail the entire scrape. This would again trigger the target-flapping alerts, but now in a "proper" way. In either case, the problem is that the GCP API seems quite unreliable in general, so flapping is inevitable and in any case creates weirdly interrupted time series about the quotas.

Which leads me to the solution I'd like to suggest here: Since quota information is changing only very very slowly, but retrieving it is so exceedingly unreliable, I think it is OK to cache the result and return a cached version if the retrieval is failing. Perhaps then add a metric that contains the timestamp of the last retrieval, so that you can detect overly stale quota information.

https://prometheus.io/docs/instrumenting/writing_exporters/#scheduling has some best practices about this topic, but it doesn't really cover the scenario "usually fast enough for synchronously pulling the source metrics but quite often taking too long".

beorn7 commented 4 years ago

In our particular case, just setting a much higher timeout and higher retry count might already help, but in any case, exposing a up metric is very confusing and should be changed to gcp_quota_up at the very least.

Duologic commented 4 years ago

Thanks @beorn7.

Which leads me to the solution I'd like to suggest here: Since quota information is changing only very very slowly, but retrieving it is so exceedingly unreliable, I think it is OK to cache the result and return a cached version if the retrieval is failing. Perhaps then add a metric that contains the timestamp of the last retrieval, so that you can detect overly stale quota information.

+1 on the caching, it should mitigate the target-flapping completely.

Duologic commented 4 years ago

Beorn has configured these values, which prevent it from flapping now:

GCP_EXPORTER_MAX_RETRIES=10
GCP_EXPORTER_HTTP_TIMEOUT=2m
primeroz commented 4 years ago

The gcp-quota-exporter exposes its own up metric (https://github.com/mintel/gcp-quota-exporter/blob/master/main.go#L28). This is super weird as it prevents Prometheus from adding the usual synthetic up metric.

It make sense, since we scrape 2 different api ( project and region ) we could present 2 gcp_quota_XXX_up metric and leave the main up one to the prometheus client library

Since quota information is changing only very very slowly, but retrieving it is so exceedingly unreliable, I think it is OK to cache the result and return a cached version if the retrieval is failing. Perhaps then add a metric that contains the timestamp of the last retrieval, so that you can detect overly stale quota information.

I am not so sure about this. An issue on the scraper side would look exactly like and issue on the google API side, and it could last potentially for long times. Sure you could use the last successful scrape timestamp metric to alert if it last too long ... but is it not just complicating things ? We would be returning an up=1 when the scrape has actually failed ... that does not seem right to me.

In our particular case, just setting a much higher timeout and higher retry count might already help, but in any case, exposing a up metric is very confusing and should be changed to gcp_quota_up at the very least. Beorn has configured these values, which prevent it from flapping now:

This seems like a much better solution to me and we could definetely update the default values to something more reasonable. We never really did too much investigation into the right values to be honest but i think we should probably aim for a higher retry count

I also wonder if our use of the rehttp is actually correct to handle retry on timeout (note this was mostly a copy of the stackdriver exporter ) but https://github.com/mintel/gcp-quota-exporter/blob/master/main.go#L129 , where we set the status to 503 ... so it might simply not even retry on timeout at all with this default configuration.

More investigation needed ! but thanks for the work

primeroz commented 4 years ago

given that defaultClient just expect an http.Client we might try to use https://github.com/hashicorp/go-retryablehttp which is much easier and cleaner and definetely retry on client errors

primeroz commented 4 years ago

In our particular case, just setting a much higher timeout and higher retry count might already help

We need to be careful to have default values that match default prometheus scraping timeout or the scraper might timeout before the exporter does

primeroz commented 4 years ago

@beorn7 the issue with the up series should be handled https://github.com/mintel/gcp-quota-exporter/pull/11

beorn7 commented 4 years ago

With our manually increased timeouts and scrape interval, we are faring quite well so far.

I guess it's best for now to not complicate things and refrain from caching for the time being.

Duologic commented 4 years ago

@primeroz As beorn mentioned, we are faring quite well, there is no reason for me to keep this issue open any longer. WDYT?

primeroz commented 4 years ago

I wanted to have a better look at the retry logic which I don't think is working as expected.

So let's keep this open and I will close it when I am back from holiday ?