reanahub / reana

REANA: Reusable research data analysis platform
https://docs.reana.io
MIT License
127 stars 54 forks source link

deployment: pod evinctions on kube-system for cephfs/cvmfs break REANA #259

Closed diegodelemos closed 4 years ago

diegodelemos commented 4 years ago

After deploying REANA on a small cluster and testing it we have discovered the following. After running some workflows we have run a workflow requiring CVMFS and the cluster started to behave wrongly.

Firstly, we saw that the cluster was not accessible:

$ reana-client ping         
Could not connect to the selected REANA cluster server at https://reana-dev.cern.ch/. 

Then we saw that this was caused by the Traefik pod being evicted:

$ kubectl  get pods -n kube-system
NAME                                                      READY   STATUS    RESTARTS   AGE
...
ingress-traefik-n5xgl                                     0/1     Evicted   0          34s
...

Then, once Traefik recovered, CephFS fell, causing the following error:

$ reana-client list                                                                                        
Workflow list could not be retrieved:                                                            
root path '/var/reana' does not exist 

Which represents again a system-wide failure.

This is again, system pods being evicted:

$  kubectl get pods -n kube-system
NAME                                                      READY   STATUS    RESTARTS   AGE
...
csi-cephfsplugin-72kxs                                    3/3     Running   0          14d
csi-cephfsplugin-provisioner-5478f6cf9b-ff4rq             4/4     Running   0          134m
csi-cephfsplugin-provisioner-5478f6cf9b-fvbxx             0/4     Evicted   0          14d
csi-cephfsplugin-provisioner-5478f6cf9b-hcpvh             0/4     Evicted   0          14d
csi-cephfsplugin-provisioner-5478f6cf9b-md7kd             0/4     Evicted   0          14d
csi-cephfsplugin-provisioner-5478f6cf9b-mr9sk             4/4     Running   0          134m
csi-cephfsplugin-provisioner-5478f6cf9b-q4m8n             4/4     Running   0          135m
csi-cephfsplugin-qmn2s                                    3/3     Running   0          124m
csi-cvmfsplugin-96t7t                                     2/2     Running   0          14d
csi-cvmfsplugin-ffrlw                                     2/2     Running   0          14d
csi-cvmfsplugin-provisioner-5849479885-288wd              3/3     Running   0          135m
csi-cvmfsplugin-provisioner-5849479885-457wg              0/3     Evicted   0          14d
csi-cvmfsplugin-provisioner-5849479885-4rslp              0/3     Evicted   0          14d
csi-cvmfsplugin-provisioner-5849479885-5xj7t              3/3     Running   0          134m
csi-cvmfsplugin-provisioner-5849479885-f9rzr              0/3     Evicted   0          14d
csi-cvmfsplugin-provisioner-5849479885-xmmjz              3/3     Running   0          135m
eosxd-ld9jf                                               1/1     Running   0          12d
eosxd-vqx5p                                               1/1     Running   0          12d
...
monitoring-grafana-8546f5f777-g2bbj                       1/1     Running   0          14d
monitoring-influxdb-5d95b8747f-th9cl                      1/1     Running   0          135m
monitoring-influxdb-5d95b8747f-whjdh                      0/1     Evicted   0          14d
...

All of this evictions happened on one of the Kubernetes minions, and the pods where reporting being evicted because, in the case of the Traefik pods:

$ kubectl describe pod  ingress-traefik-n5xgl -n kube-system
...
Message:            Pod The node had condition: [DiskPressure]. 
...

And in the case of the CVMFS provisioners:

$ kubectl describe pod -n kube-system csi-cvmfsplugin-provisioner-5849479885-f9rzr
...
Message:            The node was low on resource: ephemeral-storage. Container csi-cvmfsplugin was using 32Ki, which exceeds its request of 0. Container csi-cvmfsplugin-attacher was using 3296Ki, which exceeds its request of 0. Container csi-provisioner was using 56736Ki, which exceeds its request of 0.
...

Possible causes

CVMFS cache causing the node to be under disk pressure.

Possible solutions

There are two ways to look at this:

diegodelemos commented 4 years ago

Closing as we have discovered that this happens in reana-dev.cern.ch because we use very small VMs. So the source of the problem has been identified and it hasn't happened nor in QA or production.