inveniosoftware / helm-invenio

Helm charts for deploying an Invenio instance
https://helm-invenio.readthedocs.io
7 stars 19 forks source link

Resource requests should be set for ephemeral storage #95

Open lindhe opened 10 months ago

lindhe commented 10 months ago

At least the web pod and worker beat pod are using emptyDir volumes. This will consume ephemeral-storage on the node that the pod gets scheduled on. Since we have not specified any resource requests and limits for ephemeral storage for the container, we risk that the pod gets evicted and/or crashes and/or causes resource exhaustion on the node.

Currently, my pods get evicted and I get a warning when the pod gets scheduled on a node with too little ephemeral storage available:

$ kubectl get events --field-selector involvedObject.name=worker-beat-7898d974fc-sb9xz                                                              130 ↵
LAST SEEN   TYPE      REASON                   OBJECT                             MESSAGE
46m         Warning   FailedScheduling         pod/worker-beat-7898d974fc-sb9xz   0/6 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/6 nodes are available: 6 No preemption victims found for incoming pod..
46m         Warning   FailedScheduling         pod/worker-beat-7898d974fc-sb9xz   0/6 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/6 nodes are available: 6 No preemption victims found for incoming pod..
45m         Normal    Scheduled                pod/worker-beat-7898d974fc-sb9xz   Successfully assigned invenio-dev/worker-beat-7898d974fc-sb9xz to kth-prod-1-worker-7a7516d2-v8vbc
45m         Normal    SuccessfulAttachVolume   pod/worker-beat-7898d974fc-sb9xz   AttachVolume.Attach succeeded for volume "pvc-801c874c-37a9-4520-a0e8-c59606c9d09a"
45m         Normal    Pulling                  pod/worker-beat-7898d974fc-sb9xz   Pulling image "ghcr.io/inveniosoftware/demo-inveniordm/demo-inveniordm@sha256:2193abc2caec9bc599061d6a5874fd2d7d201f55d1673a545af0a0406690e8a4"
44m         Warning   Evicted                  pod/worker-beat-7898d974fc-sb9xz   The node was low on resource: ephemeral-storage. Threshold quantity: 994154920, available: 759960Ki.
44m         Normal    Pulled                   pod/worker-beat-7898d974fc-sb9xz   Successfully pulled image "ghcr.io/inveniosoftware/demo-inveniordm/demo-inveniordm@sha256:2193abc2caec9bc599061d6a5874fd2d7d201f55d1673a545af0a0406690e8a4" in 1m2.20910036s (1m2.209116986s including waiting)
44m         Normal    Created                  pod/worker-beat-7898d974fc-sb9xz   Created container worker-beat
44m         Normal    Started                  pod/worker-beat-7898d974fc-sb9xz   Started container worker-beat
44m         Normal    Killing                  pod/worker-beat-7898d974fc-sb9xz   Stopping container worker-beat
44m         Warning   ExceededGracePeriod      pod/worker-beat-7898d974fc-sb9xz   Container runtime did not kill the pod within specified grace period.

I suggest we add resource limits and requests for ephemeral-storage on all containers that use emptyDir. I can whip up a PR for it, but I need your help to identify a reasonable size to set as request and limit.

### Instances of `emptyDir` in deployments
- [ ] https://github.com/inveniosoftware/helm-invenio/blob/bcb4ce1240ec022438407f2a8ae1f5854dc476c5/charts/invenio/templates/web-deployment.yaml#L254-L259
- [ ] https://github.com/inveniosoftware/helm-invenio/blob/bcb4ce1240ec022438407f2a8ae1f5854dc476c5/charts/invenio/templates/web-deployment.yaml#L264-L265
- [ ] https://github.com/inveniosoftware/helm-invenio/blob/bcb4ce1240ec022438407f2a8ae1f5854dc476c5/charts/invenio/templates/web-deployment.yaml#L272-L273
- [ ] https://github.com/inveniosoftware/helm-invenio/blob/bcb4ce1240ec022438407f2a8ae1f5854dc476c5/charts/invenio/templates/worker-deployment.yaml#L291-L293
- [ ] https://github.com/inveniosoftware/helm-invenio/blob/bcb4ce1240ec022438407f2a8ae1f5854dc476c5/charts/invenio/templates/worker-deployment.yaml#L163-L164
lindhe commented 10 months ago

Here's another example of what it can look like when pods are evicted because they use more resources than are available: image