neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
14.77k stars 429 forks source link

consumption metrics reporting worker: behavior with slow reporting endpoint #4528

Open problame opened 1 year ago

problame commented 1 year ago

The current collect_metrics_iteration impl has a few deficiencies when it comes to how it carries the metrics over to the reporting endpoint.

Background

Console code that handles implements the reporting endpoint in staging and prod (private repo):

https://github.com/neondatabase/cloud/blob/9d0fd03039449218eecfab63c800943472047db2/goapp/internal/presenters/billingv1prtr/req_usage.go#L11-L39

Pageserver code that calls that endpoint:

https://github.com/neondatabase/neon/blob/b3dd1ba352f035d5a796f389566716389aff0c3a/pageserver/src/consumption_metrics.rs#L259-L306

The gist of the pageserver-side code is:

Problems

There's no request timeout when sending a chunk. Which means that if one request gets stuck,

  1. we won't make progress in reporting the remaining consumption metrics computed in this iteration
  2. we won't start a new iteration, i.e., the new iteration will be late
  3. if we're stuck for long, we're violating billing SLOs

Constraints

The control plane endpoint doesn't buffer the metrics. It funnels them directly to our billing provider, and will not answer the pageserver's POST request until that has happened.

Apparently there's WIP to change this, but right now, this is where we are.

See https://neondb.slack.com/archives/C033RQ5SPDH/p1686923492898299?thread_ts=1686852148.466139&cid=C033RQ5SPDH .

Solutions

Having no timeout is bad. We should have one, but, it should be relatively high because of above constraints in control plane. Say we need to send N chunks, then, for sending chunk i, the timeout for the POST request is min(1min, max(10s, time_remaining_until_next_iteration(now()) / chunks_remaining)). Or something less complicated. The point is: it needs to be a high because console is effectively just a proxy to an external API right now.

We should have retries as well. The idempotency key is there for a reason, let's use it.

To constrain blast radius of timeouts, have a per-tenant loop instead of one global loop. That way, a stuck POST request only affects one tenant. Also, it would allow us to have per-tenant-configurable reporting intervals.

Finally, consumption metrics reporting needs to become a Pageserver SLO. We'll track this in a separate issue.

Just an idea.

problame commented 1 year ago

cc @vadim2404 @shanyp @lubennikovaav

lubennikovaav commented 1 year ago

Totally agree with adding timeouts and retries.

To constrain blast radius of timeouts, have a per-tenant loop instead of one global loop.

Can you please elaborate? The reason to have a global loop was to send metrics in batches. IIRC, this was the requirement from console - to batch metrics (cc @antonyc )

Also, it would allow us to have per-tenant-configurable reporting intervals.

What is the use case for this?

shanyp commented 1 year ago

@problame is this what you had in mind?

problame commented 1 year ago

To constrain blast radius of timeouts, have a per-tenant loop instead of one global loop.

Can you please elaborate? The reason to have a global loop was to send metrics in batches. IIRC, this was the requirement from console - to batch metrics (cc @antonyc )

Hm, I didn't know about the batching requirement. Do you have more background on that?

problame commented 1 year ago

What is the use case for this?

Some customers/partners ask for more frequent consumption metrics than other customers/partners.

koivunej commented 1 year ago

In practice it does not seem that we would suffer from this.