Closed lukasboettcher closed 1 month ago
Name | Link |
---|---|
Latest commit | c58989483bc80dd5294465dd8be74081f0594679 |
Latest deploy log | https://app.netlify.com/sites/capsule-documentation/deploys/664789c665c5f900081158f5 |
This looks smart and elegant, wondering why we didn't think about it.
@oliverbaehler I remember Resource Quotas at the Tenant scope have been our Achilles' heel, if the proposed changes are preventing the overflow, it's a huge +1 for me.
However it does not entirely fix the problem described in the issue.
May I ask you to elaborate a bit more? I suspect we have a small time window where the hard quota is not enforced, isn't it?
It does not directly, well at least i am still able to overflow namespaces with the concexutive of two scale commands. Tenant Quota:
---
apiVersion: capsule.clastix.io/v1beta2
kind: Tenant
metadata:
name: oil
spec:
owners:
- name: alice
kind: User
resourceQuotas:
scope: Tenant
items:
- hard:
limits.cpu: "4"
limits.memory: 4Gi
requests.cpu: "2"
requests.memory: 2Gi
- hard:
pods: "10"
oil-dev:
$ kubectl get pod -n oil-dev
NAME READY STATUS RESTARTS AGE
nginx-deployment-7784999db-4r6j4 1/1 Running 0 10h
nginx-deployment-7784999db-b7xvr 1/1 Running 0 10h
nginx-deployment-7784999db-hf66h 1/1 Running 0 10h
nginx-deployment-7784999db-qzlrq 1/1 Running 0 10h
nginx-deployment-7784999db-scl84 1/1 Running 0 10h
nginx-deployment-7784999db-zm7ws 1/1 Running 0 10h
- apiVersion: v1
kind: ResourceQuota
metadata:
annotations:
quota.capsule.clastix.io/hard-pods: "10"
quota.capsule.clastix.io/used-pods: "12"
creationTimestamp: "2024-05-17T21:21:42Z"
labels:
capsule.clastix.io/resource-quota: "1"
capsule.clastix.io/tenant: oil
name: capsule-oil-1
namespace: oil-dev
ownerReferences:
- apiVersion: capsule.clastix.io/v1beta2
blockOwnerDeletion: true
controller: true
kind: Tenant
name: oil
uid: a62d02e8-561d-4a8d-a7de-3e7a78f44d3e
resourceVersion: "3624"
uid: 1f03034d-a918-4981-961f-08ffb279e742
spec:
hard:
pods: "6"
status:
hard:
pods: "6"
used:
pods: "6"
oil-test:
$ kubectl get pod -n oil-test
NAME READY STATUS RESTARTS AGE
nginx-deployment-7784999db-54j74 1/1 Running 0 10h
nginx-deployment-7784999db-5l2mw 1/1 Running 0 10h
nginx-deployment-7784999db-p4zxn 1/1 Running 0 10h
nginx-deployment-7784999db-pt688 1/1 Running 0 10h
nginx-deployment-7784999db-wm4qt 1/1 Running 0 10h
nginx-deployment-7784999db-x5vc7 1/1 Running 0 10h
- apiVersion: v1
kind: ResourceQuota
metadata:
annotations:
quota.capsule.clastix.io/hard-pods: "10"
quota.capsule.clastix.io/used-pods: "12"
creationTimestamp: "2024-05-17T21:21:44Z"
labels:
capsule.clastix.io/resource-quota: "1"
capsule.clastix.io/tenant: oil
name: capsule-oil-1
namespace: oil-test
ownerReferences:
- apiVersion: capsule.clastix.io/v1beta2
blockOwnerDeletion: true
controller: true
kind: Tenant
name: oil
uid: a62d02e8-561d-4a8d-a7de-3e7a78f44d3e
resourceVersion: "3625"
uid: f79557c9-5bda-4772-8918-28b4b63b2a2f
spec:
hard:
pods: "6"
status:
hard:
pods: "6"
used:
pods: "6"
But this calculation used here is more consistent, so it's a step towards the right direction.
So we somehow need to cover that case, where at the same time a lot of scaling or resource updates are happening and our controller is to slow to updated.
One way to prevent this racing conditions, is to have a webhook, which calculates the resources requested for any resource which is being created or updated. As you have already pointed out in earlier comments. Essentially we need this function here:
Big questionmark on the performance impact if we implement it like this.
But i dont think we get around the point, that we need a validatingwebhook which validates all the objects, the question is, what's happening in the webhook function. I was also thinking, if we should implement something like a locking mechanism, so that when resource quotas are updated or synced we lock them.
I can reproduce the problem with quick consecutive scaling or when creating resources in separate namespaces in a single request for object count quotas. For compute resource quotas this seems to hold up. However I believe that this is just the case because of the order / timing these quotas are updated and evaluated, so there is definitely a race condition.
Intends to fix #49