deployment: pod evinctions on kube-system for cephfs/cvmfs break REANA

After deploying REANA on a small cluster and testing it we have discovered the following. After running some workflows we have run a workflow requiring CVMFS and the cluster started to behave wrongly.

Firstly, we saw that the cluster was not accessible:

$ reana-client ping         
Could not connect to the selected REANA cluster server at https://reana-dev.cern.ch/.

Then we saw that this was caused by the Traefik pod being evicted:

$ kubectl  get pods -n kube-system
NAME                                                      READY   STATUS    RESTARTS   AGE
...
ingress-traefik-n5xgl                                     0/1     Evicted   0          34s
...

Then, once Traefik recovered, CephFS fell, causing the following error:

$ reana-client list                                                                                        
Workflow list could not be retrieved:                                                            
root path '/var/reana' does not exist

Which represents again a system-wide failure.

This is again, system pods being evicted:

$  kubectl get pods -n kube-system
NAME                                                      READY   STATUS    RESTARTS   AGE
...
csi-cephfsplugin-72kxs                                    3/3     Running   0          14d
csi-cephfsplugin-provisioner-5478f6cf9b-ff4rq             4/4     Running   0          134m
csi-cephfsplugin-provisioner-5478f6cf9b-fvbxx             0/4     Evicted   0          14d
csi-cephfsplugin-provisioner-5478f6cf9b-hcpvh             0/4     Evicted   0          14d
csi-cephfsplugin-provisioner-5478f6cf9b-md7kd             0/4     Evicted   0          14d
csi-cephfsplugin-provisioner-5478f6cf9b-mr9sk             4/4     Running   0          134m
csi-cephfsplugin-provisioner-5478f6cf9b-q4m8n             4/4     Running   0          135m
csi-cephfsplugin-qmn2s                                    3/3     Running   0          124m
csi-cvmfsplugin-96t7t                                     2/2     Running   0          14d
csi-cvmfsplugin-ffrlw                                     2/2     Running   0          14d
csi-cvmfsplugin-provisioner-5849479885-288wd              3/3     Running   0          135m
csi-cvmfsplugin-provisioner-5849479885-457wg              0/3     Evicted   0          14d
csi-cvmfsplugin-provisioner-5849479885-4rslp              0/3     Evicted   0          14d
csi-cvmfsplugin-provisioner-5849479885-5xj7t              3/3     Running   0          134m
csi-cvmfsplugin-provisioner-5849479885-f9rzr              0/3     Evicted   0          14d
csi-cvmfsplugin-provisioner-5849479885-xmmjz              3/3     Running   0          135m
eosxd-ld9jf                                               1/1     Running   0          12d
eosxd-vqx5p                                               1/1     Running   0          12d
...
monitoring-grafana-8546f5f777-g2bbj                       1/1     Running   0          14d
monitoring-influxdb-5d95b8747f-th9cl                      1/1     Running   0          135m
monitoring-influxdb-5d95b8747f-whjdh                      0/1     Evicted   0          14d
...

All of this evictions happened on one of the Kubernetes minions, and the pods where reporting being evicted because, in the case of the Traefik pods:

$ kubectl describe pod  ingress-traefik-n5xgl -n kube-system
...
Message:            Pod The node had condition: [DiskPressure]. 
...

And in the case of the CVMFS provisioners:

$ kubectl describe pod -n kube-system csi-cvmfsplugin-provisioner-5849479885-f9rzr
...
Message:            The node was low on resource: ephemeral-storage. Container csi-cvmfsplugin was using 32Ki, which exceeds its request of 0. Container csi-cvmfsplugin-attacher was using 3296Ki, which exceeds its request of 0. Container csi-provisioner was using 56736Ki, which exceeds its request of 0.
...

Possible causes

CVMFS cache causing the node to be under disk pressure.

Possible solutions

There are two ways to look at this:

Outside REANA perspective: the underlying system shouldn't crash like this, a throttle mechanism should stop REANA from causing this major failure in the Kubernetes cluster (follow up INC2249437)
Inside REANA perspective: as a system, we have to understand that certain external conditions can be eventually met so we should check them. We have already in place cluster checks which could be completed with the current needs

reanahub / reana

deployment: pod evinctions on kube-system for cephfs/cvmfs break REANA #259

Possible causes

Possible solutions