Open problame opened 1 year ago
cc @vadim2404 @shanyp @lubennikovaav
Totally agree with adding timeouts and retries.
To constrain blast radius of timeouts, have a per-tenant loop instead of one global loop.
Can you please elaborate? The reason to have a global loop was to send metrics in batches. IIRC, this was the requirement from console - to batch metrics (cc @antonyc )
Also, it would allow us to have per-tenant-configurable reporting intervals.
What is the use case for this?
To constrain blast radius of timeouts, have a per-tenant loop instead of one global loop.
Can you please elaborate? The reason to have a global loop was to send metrics in batches. IIRC, this was the requirement from console - to batch metrics (cc @antonyc )
Hm, I didn't know about the batching requirement. Do you have more background on that?
What is the use case for this?
Some customers/partners ask for more frequent consumption metrics than other customers/partners.
In practice it does not seem that we would suffer from this.
The current
collect_metrics_iteration
impl has a few deficiencies when it comes to how it carries the metrics over to the reporting endpoint.Background
Console code that handles implements the reporting endpoint in staging and prod (private repo):
https://github.com/neondatabase/cloud/blob/9d0fd03039449218eecfab63c800943472047db2/goapp/internal/presenters/billingv1prtr/req_usage.go#L11-L39
Pageserver code that calls that endpoint:
https://github.com/neondatabase/neon/blob/b3dd1ba352f035d5a796f389566716389aff0c3a/pageserver/src/consumption_metrics.rs#L259-L306
The gist of the pageserver-side code is:
Problems
There's no request timeout when sending a chunk. Which means that if one request gets stuck,
Constraints
The control plane endpoint doesn't buffer the metrics. It funnels them directly to our billing provider, and will not answer the pageserver's POST request until that has happened.
Apparently there's WIP to change this, but right now, this is where we are.
See https://neondb.slack.com/archives/C033RQ5SPDH/p1686923492898299?thread_ts=1686852148.466139&cid=C033RQ5SPDH .
Solutions
Having no timeout is bad. We should have one, but, it should be relatively high because of above constraints in control plane. Say we need to send N chunks, then, for sending chunk i, the timeout for the POST request is
min(1min, max(10s, time_remaining_until_next_iteration(now()) / chunks_remaining))
. Or something less complicated. The point is: it needs to be a high because console is effectively just a proxy to an external API right now.We should have retries as well. The idempotency key is there for a reason, let's use it.
To constrain blast radius of timeouts, have a per-tenant loop instead of one global loop. That way, a stuck POST request only affects one tenant. Also, it would allow us to have per-tenant-configurable reporting intervals.
Finally, consumption metrics reporting needs to become a Pageserver SLO. We'll track this in a separate issue.
Just an idea.