Closed skyzh closed 3 months ago
See also: the implementation that proxy uses, here: https://github.com/neondatabase/neon/blob/e69ff3fc00ab8be31e8f69eb3726da1b83d84180/libs/metrics/src/hll.rs
(cc @conradludgate)
I need to update my impl to use rolling. Currently it's sample+clear, but rolling would protect from lost samples, although it smears the results so they have less temporal resolution
Here is a paper about rolling window HLL https://hal.science/hal-00465313/document
I've chatted yesterday with @knizhnik about this -- he is now working on https://github.com/neondatabase/neon/pull/7288 / https://github.com/neondatabase/cloud/issues/14202, and can take a look at this after (~ next week)
I will look for sliding-hyperloglog implementations. Right now I found only one in Java: sliding-hyperloglog
But before I want to clarify hiowwe are going to use this rolling estimation of working set size? Do we need to know how much unique pages are accessed during let's say last hour? But how answer to this question can be interpreted? Assume that compute has performed N get_page requests during last hour and ~M of them are unique. Does it mean that working set size is M? Actually not. If we set LFC size to M, there is absolutely no warranty that we can avoid most of requests to PS. May be we can make some conclusion based on N/M ratio, but it is not quite clear to me how to do it.
<y original assumption is that working set size estimation is needed to answer the question whether we can fit all working set in LFC or not. For most of our projects it is true. But certainly there can be projects which working set do not fit in local disk, doesn't;t matter how larger they are and how much space we can provide for LFC.
Just want to explain my point of view more precisely.
The only case when estimation of working set size can help - is to chose LFC policy: should we try to keep the whole WS in LFC or it is not possible and we s would just limit size of LFC depending on number of tenants per node and other factors.
@knizhnik I think I understand where you're coming from -- IMO the ideas you mention are out of scope for the time being. Specifically:
There is no need to estimate working set to shrink LFC. If nobody is pretending on this space, we can keep cached data hoping that sometime it may be requested. And once somebody needs more space (LFC of other tenant or pg_temp) we can very fast shrink LFC
This would be a substantial effort under the current architecture. Right now, we rely on LFC size being proportional to CU in order to keep disk usage low, and don't have the kind of node-level involvement that would make this more feasible.
To provide further context:
But before I want to clarify hiowwe are going to use this rolling estimation of working set size? Do we need to know how much unique pages are accessed during let's say last hour?
Broadly, we plan to incorporate it into the scaling algorithm. We haven't yet decided what exact usage will look like -- expect both (a) an RFC about it, and (b) testing to refine it.
After considering this for a bit:
We could expose the raw values of the buckets and let the autoscaler-agent aggregate it into a working set size estimation over an arbitrary time window (within however long it's been looking at the metrics). This is roughly how proxy's HLL-based metrics work.
I think that approach will not work as well here -- it leaves the autoscaler-agent (or, some other component?) responsible for storing the metrics, which could be an issue for intervals longer than a few minutes. The lack of pre-aggregation would also make observability for these estimates more difficult (both in internal dashboards, and in customer-facing metrics).
So: IMO the LFC should expose metrics that are similar in spirit to load average -- basically, working set size over a small number of intervals of your choice. To keep it simple, we can start with just one (e.g., past 5 minutes?), and add more as we see fit, based on testing.
in a window of time
Do we need arbitrary window of time, or is an open-ended window from some point in time good enough?
Now that #8068 has been merged, is this complete?
No, we need to wait until it is deployed, wait one week and then I will create another PR charging default version to 1.4
This week:
The current LFC uses hyperloglog to estimate the working set size, which does not correctly deal with deletes, and does not capture the working set size used in a window of time. The working set estimation should be more accurate by using some algorithms like rolling hyperloglog.
(discussed briefly at the compute offsite @knizhnik, and cc @sharnoff)