Expand petset volume zone spreading

bprashanth commented 8 years ago

We got petset disk zone spreading in and it's really useful. However we left a couple of todos to follow up on:

Don't embed a zone scheduler in the pv provisioner (https://github.com/kubernetes/kubernetes/pull/27553/files#diff-b3d75e3586a2c9a5140cd549861da9c0R2094)
Write a unitest that protects the zone spreding from petset implementation changes (https://github.com/kubernetes/kubernetes/pull/27553#discussion_r67625074)
Maybe a multi-az e2e with petset before it goes beta?

@justinsb

chrislovecnm commented 8 years ago

I am having scaling issues with multiple zone PVCs on GCE. We are getting timeout errors with the PVCs. We have 1000 nodes spread across 3 az in us-central1.

Here is an example of how long the times are taking:

$ kubectl get po
NAME                    READY     STATUS              RESTARTS   AGE
cassandra-analytics-0   1/1       Running             0          12m
cassandra-analytics-1   1/1       Running             0          7m
cassandra-analytics-2   1/1       Running             0          2m
cassandra-analytics-3   0/1       ContainerCreating   0          28s
cassandra-data-0        1/1       Running             0          12m
cassandra-data-1        1/1       Running             0          5m
cassandra-data-2        0/1       ContainerCreating   0          1m

I have seen times for 1-2m po creation in the same zones, but now in multiple zones it is crawling.

$ kubectl describe cassandra-data-1
Events:
  FirstSeen LastSeen    Count   From                    SubobjectPath           Type        Reason      Message
  --------- --------    -----   ----                    -------------           --------    ------      -------
  7m        7m      1   {default-scheduler }                            Normal      Scheduled   Successfully assigned cassandra-data-1 to kubernetes-minion-group-x8ev
  5m        5m      1   {kubelet kubernetes-minion-group-x8ev}                  Warning     FailedMount Unable to mount volumes for pod "cassandra-data-1_default(5acd02c7-38ea-11e6-b7d0-42010a800002)": timeout expired waiting for volumes to attach/mount for pod "cassandra-data-1"/"default". list of unattached/unmounted volumes=[cassandra-data]
  5m        5m      1   {kubelet kubernetes-minion-group-x8ev}                  Warning     FailedSync  Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "cassandra-data-1"/"default". list of unattached/unmounted volumes=[cassandra-data]
  3m        3m      1   {kubelet kubernetes-minion-group-x8ev}  spec.containers{cassandra}  Normal      Pulling     pulling image "gcr.io/aronchick-apollobit/cassandra-debian:v1.1"
  3m        3m      1   {kubelet kubernetes-minion-group-x8ev}  spec.containers{cassandra}  Normal      Pulled      Successfully pulled image "gcr.io/aronchick-apollobit/cassandra-debian:v1.1"
  3m        3m      1   {kubelet kubernetes-minion-group-x8ev}  spec.containers{cassandra}  Normal      Created     Created container with docker id 05c294ddd491
  3m        3m      1   {kubelet kubernetes-minion-group-x8ev}  spec.containers{cassandra}  Normal      Started     Started container with docker id 05c294ddd491

I am running the last 1.3 with https://github.com/kubernetes/kubernetes/pull/27553 patched in. I am also provisioning ssd pvc, but the volumes are created long before the timeout issues.

cc: @justinsb @bgrant0607 @saad-ali @erictune

chrislovecnm commented 8 years ago

@saad-ali / @bprashanth

This seems to be specifically when the Pet / Pod is being deployed in a zone where the master is not in. Some spot checking has shown that pods in central1-b, same as the master, get deployed quickly, which pods in central1-c or central-f, have the Unable to mount volumes for pod error from above.

chrislovecnm commented 8 years ago

The sync loop in reconciler.go seems to be running for 2min

I0623 05:10:59.821836    3599 kubelet.go:2528] SyncLoop (ADD, "api"): "cassandra-analytics-31_default(dfbe550a-3900-11e6-b7d0-42010a800002)"
I0623 05:10:59.978531    3599 reconciler.go:179] VerifyControllerAttachedVolume operation started for volume "kubernetes.io/gce-pd/kubernetes-dynamic-pvc-9c4c75aa-38f0-11e6-b7d0-42010a800002" (spec.Name: "pvc-9c4c75aa-38f0-11e6-b7d0-42010a800002") pod "dfbe550a-3900-11e6-b7d0-42010a800002" (UID: "dfbe550a-3900-11e6-b7d0-42010a800002")
I0623 05:10:59.978599    3599 reconciler.go:179] VerifyControllerAttachedVolume operation started for volume "kubernetes.io/secret/default-token-mu6y5" (spec.Name: "default-token-mu6y5") pod "dfbe550a-3900-11e6-b7d0-42010a800002" (UID: "dfbe550a-3900-11e6-b7d0-42010a800002")
E0623 05:10:59.980959    3599 goroutinemap.go:155] Operation for "kubernetes.io/gce-pd/kubernetes-dynamic-pvc-9c4c75aa-38f0-11e6-b7d0-42010a800002" failed. No retries permitted until 2016-06-23 05:11:00.480952365 +0000 UTC (durationBeforeRetry 500ms). error: Volume "kubernetes.io/gce-pd/kubernetes-dynamic-pvc-9c4c75aa-38f0-11e6-b7d0-42010a800002" (spec.Name: "pvc-9c4c75aa-38f0-11e6-b7d0-42010a800002") pod "dfbe550a-3900-11e6-b7d0-42010a800002" (UID: "dfbe550a-3900-11e6-b7d0-42010a800002") is not yet attached according to node status.
I0623 05:11:00.078995    3599 reconciler.go:253] MountVolume operation started for volume "kubernetes.io/secret/default-token-mu6y5" (spec.Name: "default-token-mu6y5") to pod "dfbe550a-3900-11e6-b7d0-42010a800002" (UID: "dfbe550a-3900-11e6-b7d0-42010a800002").
I0623 05:11:00.084887    3599 operation_executor.go:673] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/default-token-mu6y5" (spec.Name: "default-token-mu6y5") pod "dfbe550a-3900-11e6-b7d0-42010a800002" (UID: "dfbe550a-3900-11e6-b7d0-42010a800002").
I0623 05:11:00.580234    3599 reconciler.go:179] VerifyControllerAttachedVolume operation started for volume "kubernetes.io/gce-pd/kubernetes-dynamic-pvc-9c4c75aa-38f0-11e6-b7d0-42010a800002" (spec.Name: "pvc-9c4c75aa-38f0-11e6-b7d0-42010a800002") pod "dfbe550a-3900-11e6-b7d0-42010a800002" (UID: "dfbe550a-3900-11e6-b7d0-42010a800002")
E0623 05:11:00.582425    3599 goroutinemap.go:155] Operation for "kubernetes.io/gce-pd/kubernetes-dynamic-pvc-9c4c75aa-38f0-11e6-b7d0-42010a800002" failed. No retries permitted until 2016-06-23 05:11:01.582420585 +0000 UTC (durationBeforeRetry 1s). error: Volume "kubernetes.io/gce-pd/kubernetes-dynamic-pvc-9c4c75aa-38f0-11e6-b7d0-42010a800002" (spec.Name: "pvc-9c4c75aa-38f0-11e6-b7d0-42010a800002") pod "dfbe550a-3900-11e6-b7d0-42010a800002" (UID: "dfbe550a-3900-11e6-b7d0-42010a800002") is not yet attached according to node status.

And then it works

I0623 05:15:07.929521    3599 reconciler.go:179] VerifyControllerAttachedVolume operation started for volume "kubernetes.io/gce-pd/kubernetes-dynamic-pvc-9c4c75aa-38f0-11e6-b7d0-42010a800002" (spec.Name: "pvc-9c4c75aa-38f0-11e6-b7d0-42010a800002") pod "dfbe550a-3900-11e6-b7d0-42010a800002" (UID: "dfbe550a-3900-11e6-b7d0-42010a800002")
I0623 05:15:07.932210    3599 operation_executor.go:897] Controller successfully attached volume "kubernetes.io/gce-pd/kubernetes-dynamic-pvc-9c4c75aa-38f0-11e6-b7d0-42010a800002" (spec.Name: "pvc-9c4c75aa-38f0-11e6-b7d0-42010a800002") pod "dfbe550a-3900-11e6-b7d0-42010a800002" (UID: "dfbe550a-3900-11e6-b7d0-42010a800002")
I0623 05:15:08.029886    3599 reconciler.go:253] MountVolume operation started for volume "kubernetes.io/gce-pd/kubernetes-dynamic-pvc-9c4c75aa-38f0-11e6-b7d0-42010a800002" (spec.Name: "pvc-9c4c75aa-38f0-11e6-b7d0-42010a800002") to pod "dfbe550a-3900-11e6-b7d0-42010a800002" (UID: "dfbe550a-3900-11e6-b7d0-42010a800002").

Here is a grep for PVC out of the kubelet.log https://gist.github.com/chrislovecnm/50d722b04f73dcfd92800be48c584efa

This process is taking about 30 secs when the master is in the same zone as the node, otherwise it is taking 2-4 min.

This is a big scaling issue. Compare 30 seconds to 2-4 minutes when you are looking at 1000 pvc to mount.

chrislovecnm commented 8 years ago

@saad-ali / @bprashanth you want me to open a separate issue for the scaling challenges?

cc: @bgrant0607

chrislovecnm commented 8 years ago

Now I ran into another issue with this:

I have 1008 minions (added 8 more because of this problem, and did not help). I am unable to deploy my last two C* instances. I am stuck at 998 :( The last pod get's stuck in pending am getting the following errors:

$ kubectl describe  po cassandra-data-498
<redacted for brevity>
fit failure on node (kubernetes-minion-group-cxmy): Insufficient CPU
fit failure on node (kubernetes-minion-group-zc10): Insufficient CPU
fit failure on node (kubernetes-minion-group-iyit): Insufficient CPU
fit failure on node (kubernetes-minion-group-nnhm): Insufficient CPU
fit failure on node (kubernetes-minion-group-2k68): NoVolumeZoneConflict
fit failure on node (kubernetes-minion-group-xfnx): Insufficient CPU
fit failure on node (kubernetes-minion-group-hbxt): Insufficient CPU
fit failure on node (kubernetes-minion-group-srdl): Insufficient CPU
fit failure on node (kubernetes-minion-group-fx7b): Insufficient CPU
fit failure on node (kubernetes-minion-group-26wv): Insufficient CPU
fit failure on node (kubernetes-minion-group-nd2g): Insufficient CPU
fit failure on node (kubernetes-minion-group-n2px): NoVolumeZoneConflict
fit failure on node (kubernetes-minion-group-4ndb): Insufficient CPU
fit failure on node (kubernetes-minion-group-7zf8): Insufficient CPU

I have head room, as I have 8 new nodes in us-central1a that have nothing on them.

Ideas ... kinda urgent, as my Demo is on Tuesday :( Don't think I have the time to recreate all of the C* instances. I have reduced the petset size by one, and increased it by one, and K8s keeps putting the volume into zone c. When that happens the pod just stays in pending.

bprashanth commented 8 years ago

Im going to guess it's complaining that the pod is asking for a volume in a zone without capacity. Zone spreading is pretty dumb. You probably have the nodes in the wrong zone. If a volume is created in a zone you can't schedule anywhere else. Get the pvs for that pet (looks like they're already created because you scaled up and down -- kubectl get pv | awk '{print $5}' | grep 498 shoudl show you the pv, kubectl get pv PVNAME -o yaml | grep -i zone should show you the zone if it exists) and try create nodes in that zone.

chrislovecnm commented 8 years ago

Yep ... added another node in the same zone and I got to 1k. I am loosing pets because of app problems, and one went into a CrashLoop state. I deleted the pet, and now I am getting:

$ kubectl describe po cassandra-analytics-112

  FirstSeen LastSeen    Count   From                    SubobjectPath   Type        Reason      Message
  --------- --------    -----   ----                    -------------   --------    ------      -------
  8m        8m      1   {default-scheduler }                    Normal      Scheduled   Successfully assigned cassandra-analytics-112 to kubernetes-minion-group-0jg9
  6m        17s     4   {kubelet kubernetes-minion-group-0jg9}          Warning     FailedMount Unable to mount volumes for pod "cassandra-analytics-112_default(13171b3b-3a50-11e6-b7d0-42010a800002)": timeout expired waiting for volumes to attach/mount for pod "cassandra-analytics-112"/"default". list of unattached/unmounted volumes=[cassandra-analytics]
  6m        17s     4   {kubelet kubernetes-minion-group-0jg9}          Warning     FailedSync  Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "cassandra-analytics-112"/"default". list of unattached/unmounted volumes=[cassandra-analytics]

kubelet.log

E0624 21:23:22.929674    3622 goroutinemap.go:155] Operation for "kubernetes.io/gce-pd/kubernetes-dynamic-pvc-a5f45f53-38f0-11e6-b7d0-42010a800002" failed. No retries permitted until 2016-06-24 21:25:22.92967033 +0000 UTC (durationBeforeRetry 2m0s). error: UnmountVolume.TearDown failed for volume "kubernetes.io/gce-pd/kubernetes-dynamic-pvc-a5f45f53-38f0-11e6-b7d0-42010a800002" (volume.spec.Name: "cassandra-analytics") pod "6a577c2c-3a4b-11e6-b7d0-42010a800002" (UID: "6a577c2c-3a4b-11e6-b7d0-42010a800002") with: remove /var/lib/kubelet/pods/6a577c2c-3a4b-11e6-b7d0-42010a800002/volumes/kubernetes.io~gce-pd/pvc-a5f45f53-38f0-11e6-b7d0-42010a800002: directory not empty

So I assume that it is behaving as expected, but I am a dork, and probably should have not deleted the pet.

Options??

irfanurrehman commented 6 years ago

This issue was labelled only for sig/multicluster and is thus moved over to kubernetes/federation#50. If this does not seem to be right, please reopen this and notify us @kubernetes/sig-multicluster-misc. /close

kubernetes / kubernetes

Expand petset volume zone spreading #27809