sapcc / go-api-declarations

Reusable declarations for Go types appearing in our APIs
Apache License 2.0
3 stars 0 forks source link

liquid: add support for AZ-aware quotas #37

Closed majewsky closed 4 days ago

majewsky commented 1 week ago

This is becoming a requirement for liquid-ceph: We will have several storage classes grouped into a single resource as the resource's AZ slots. Using placeholder names for illustrative purposes, there might be several storage classes like:

I recommended against modelling those as separate AZ-unaware resources, so these will be grouped into a single resource "3 replicas in the same AZ". The main consequence from this is that LIQUID needs to add support for AZ-aware quotas (instead of just AZ-aware capacity and usage).

So an additional configuration is needed in type ResourceInfo. To avoid turning ResourceInfo into a sea of interconnected booleans, I'm introducing a new enum type ResourceTopology. The existing two behaviors are described by FlatResourceTopology (no AZ-awareness, everything is in AZ any) and AZAwareResourceTopology (AZ-awareness for usage and capacity, but not quota), and the new behavior is AZSeparatedResourceTopology (AZ-awareness for usage and capacity and also quota). As part of this, the previously implicit differentiation between flat and AZ-aware topology becomes explicit now, in order for Limes to become able to act as a kind of linter for liquid behavior.

To give a preview of the implementation scope for Limes, this means that:

  1. When reading any response from a liquid, we need to validate the PerAZ fields against the declared topology. For example, if a resource is declared with AZAwareResourceTopology, a capacity report for AZ any shall be rejected because only known AZs and unknown are allowed for this topology. This is an easy change that is localized to the LIQUID plugin bridge. (In the future, we can think about using the topology to optimize algorithms inside Limes, but that probably won't be in scope for the initial work package. For example, the commitment_is_az_aware config flag can be replaced by a check for the selected topology.)
  2. When writing quota, we need to break down quota values by AZ for resources with AZSeparatedResourceTopology. This is another easy change: We already have the AZ quotas in our DB, we just need to put them in the request.
  3. When collecting usage data, we need to read quota values broken down by AZ for AZSeparatedResourceTopology. This is a slightly larger change that involves adding project_az_resources.backend_quota to the DB schema, but not too bad, either.
  4. The trouble happens when calculating AZ quotas. When applying a base quota, we currently put it in AZ any. But this does not make sense for the storage class scenario described above: Which distinct storage class would you give the quota for storage class any? We cannot divide it between storage classes because SetQuota only sees the numbers for one specific project and does not have any information about capacity distribution. My best idea right now is to treat the configured base quota as applying for each AZ separately for AZSeparatedResourceTopology, but I'll let this problem simmer in my head a bit longer before deciding on a solution. What's clear is that we need some solution because base quota is going to be desirable for Ceph resources eventually.
majewsky commented 4 days ago

The quota set retrieves its information from the resource table, so the ceph endpoint should have some awareness to differentiate between AZ. Is that the case? Otherwise I agree on the Base quota assignment problem that you stated.

I don't think I understand the question. Let's follow up on this on the phone.

On your first point you mentioned a capacity report (I presume the request to the liquid and therefore to the service)

Reports are from the liquid to Limes; the opposite direction is a request.

Then just to clarify, what you mean that needs to be done here is to simply check if the resource is truly AZ aware or not during the scrape, right?

Yes, the intended change is that if the resource is AZ-aware, Limes ought to ignore the report for AZ any, but not as a silent error.

I'm not quite sure what you mean by the mentioned db field in point 3. project_az_resources.backend_quota. Why would this apporach be necessary? Can you elaborate a little bit further?

Right now, liquids only report quota per project, not per project and AZ. This information is persisted in project_resources.backend_quota, and used to decide when to force a sync of quota from Limes to the liquid. If AZ-aware quota is introduced, the same data needs to be retained on the AZ level.