neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
15.13k stars 442 forks source link

LFC: using rolling hyperloglog for correct working set estimation #7466

Closed skyzh closed 3 months ago

skyzh commented 6 months ago

The current LFC uses hyperloglog to estimate the working set size, which does not correctly deal with deletes, and does not capture the working set size used in a window of time. The working set estimation should be more accurate by using some algorithms like rolling hyperloglog.

(discussed briefly at the compute offsite @knizhnik, and cc @sharnoff)

sharnoff commented 6 months ago

See also: the implementation that proxy uses, here: https://github.com/neondatabase/neon/blob/e69ff3fc00ab8be31e8f69eb3726da1b83d84180/libs/metrics/src/hll.rs

(cc @conradludgate)

conradludgate commented 6 months ago

I need to update my impl to use rolling. Currently it's sample+clear, but rolling would protect from lost samples, although it smears the results so they have less temporal resolution

stradig commented 5 months ago

Here is a paper about rolling window HLL https://hal.science/hal-00465313/document

kelvich commented 5 months ago

I've chatted yesterday with @knizhnik about this -- he is now working on https://github.com/neondatabase/neon/pull/7288 / https://github.com/neondatabase/cloud/issues/14202, and can take a look at this after (~ next week)

knizhnik commented 5 months ago

I will look for sliding-hyperloglog implementations. Right now I found only one in Java: sliding-hyperloglog

But before I want to clarify hiowwe are going to use this rolling estimation of working set size? Do we need to know how much unique pages are accessed during let's say last hour? But how answer to this question can be interpreted? Assume that compute has performed N get_page requests during last hour and ~M of them are unique. Does it mean that working set size is M? Actually not. If we set LFC size to M, there is absolutely no warranty that we can avoid most of requests to PS. May be we can make some conclusion based on N/M ratio, but it is not quite clear to me how to do it.

<y original assumption is that working set size estimation is needed to answer the question whether we can fit all working set in LFC or not. For most of our projects it is true. But certainly there can be projects which working set do not fit in local disk, doesn't;t matter how larger they are and how much space we can provide for LFC.

knizhnik commented 5 months ago

Just want to explain my point of view more precisely.

The only case when estimation of working set size can help - is to chose LFC policy: should we try to keep the whole WS in LFC or it is not possible and we s would just limit size of LFC depending on number of tenants per node and other factors.

sharnoff commented 5 months ago

@knizhnik I think I understand where you're coming from -- IMO the ideas you mention are out of scope for the time being. Specifically:

There is no need to estimate working set to shrink LFC. If nobody is pretending on this space, we can keep cached data hoping that sometime it may be requested. And once somebody needs more space (LFC of other tenant or pg_temp) we can very fast shrink LFC

This would be a substantial effort under the current architecture. Right now, we rely on LFC size being proportional to CU in order to keep disk usage low, and don't have the kind of node-level involvement that would make this more feasible.

To provide further context:

But before I want to clarify hiowwe are going to use this rolling estimation of working set size? Do we need to know how much unique pages are accessed during let's say last hour?

Broadly, we plan to incorporate it into the scaling algorithm. We haven't yet decided what exact usage will look like -- expect both (a) an RFC about it, and (b) testing to refine it.

After considering this for a bit:

We could expose the raw values of the buckets and let the autoscaler-agent aggregate it into a working set size estimation over an arbitrary time window (within however long it's been looking at the metrics). This is roughly how proxy's HLL-based metrics work.

I think that approach will not work as well here -- it leaves the autoscaler-agent (or, some other component?) responsible for storing the metrics, which could be an issue for intervals longer than a few minutes. The lack of pre-aggregation would also make observability for these estimates more difficult (both in internal dashboards, and in customer-facing metrics).

So: IMO the LFC should expose metrics that are similar in spirit to load average -- basically, working set size over a small number of intervals of your choice. To keep it simple, we can start with just one (e.g., past 5 minutes?), and add more as we see fit, based on testing.

MMeent commented 5 months ago

in a window of time

Do we need arbitrary window of time, or is an open-ended window from some point in time good enough?

sharnoff commented 4 months ago

Now that #8068 has been merged, is this complete?

knizhnik commented 4 months ago

No, we need to wait until it is deployed, wait one week and then I will create another PR charging default version to 1.4

ololobus commented 4 months ago

This week:

knizhnik commented 4 months ago

https://github.com/neondatabase/neon/pull/8405