After deploying REANA on a small cluster and testing it we have discovered the following. After running some workflows we have run a workflow requiring CVMFS and the cluster started to behave wrongly.
Firstly, we saw that the cluster was not accessible:
$ reana-client ping
Could not connect to the selected REANA cluster server at https://reana-dev.cern.ch/.
Then we saw that this was caused by the Traefik pod being evicted:
$ kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
...
ingress-traefik-n5xgl 0/1 Evicted 0 34s
...
Then, once Traefik recovered, CephFS fell, causing the following error:
$ reana-client list
Workflow list could not be retrieved:
root path '/var/reana' does not exist
All of this evictions happened on one of the Kubernetes minions, and the pods where reporting being evicted because, in the case of the Traefik pods:
$ kubectl describe pod ingress-traefik-n5xgl -n kube-system
...
Message: Pod The node had condition: [DiskPressure].
...
And in the case of the CVMFS provisioners:
$ kubectl describe pod -n kube-system csi-cvmfsplugin-provisioner-5849479885-f9rzr
...
Message: The node was low on resource: ephemeral-storage. Container csi-cvmfsplugin was using 32Ki, which exceeds its request of 0. Container csi-cvmfsplugin-attacher was using 3296Ki, which exceeds its request of 0. Container csi-provisioner was using 56736Ki, which exceeds its request of 0.
...
Possible causes
CVMFS cache causing the node to be under disk pressure.
Possible solutions
There are two ways to look at this:
Outside REANA perspective: the underlying system shouldn't crash like this, a throttle mechanism should stop REANA from causing this major failure in the Kubernetes cluster (follow up INC2249437)
Inside REANA perspective: as a system, we have to understand that certain external conditions can be eventually met so we should check them. We have already in place cluster checks which could be completed with the current needs
Closing as we have discovered that this happens in reana-dev.cern.ch because we use very small VMs. So the source of the problem has been identified and it hasn't happened nor in QA or production.
After deploying REANA on a small cluster and testing it we have discovered the following. After running some workflows we have run a workflow requiring CVMFS and the cluster started to behave wrongly.
Firstly, we saw that the cluster was not accessible:
Then we saw that this was caused by the Traefik pod being evicted:
Then, once Traefik recovered, CephFS fell, causing the following error:
Which represents again a system-wide failure.
This is again, system pods being evicted:
All of this evictions happened on one of the Kubernetes minions, and the pods where reporting being evicted because, in the case of the Traefik pods:
And in the case of the CVMFS provisioners:
Possible causes
CVMFS cache causing the node to be under disk pressure.
Possible solutions
There are two ways to look at this: