kubernetes-sigs / sig-storage-lib-external-provisioner

Apache License 2.0
547 stars 173 forks source link

Failing unit test: TestTopologyParams #160

Closed jsafrane closed 7 months ago

jsafrane commented 11 months ago

Unit test TestTopologyParams/provision_with_selected_node,_but_node_does_not_exist fails frequently with log:

W1221 13:41:19.810378 1289631 shared_informer.go:544] resyncPeriod 100ms is too small. Changing it to the minimum allowed value of 1s
W1221 13:41:19.810457 1289631 shared_informer.go:544] resyncPeriod 100ms is too small. Changing it to the minimum allowed value of 1s
I1221 13:41:19.810509 1289631 leaderelection.go:248] attempting to acquire leader lease default/foo.bar-baz...
I1221 13:41:19.810621 1289631 leaderelection.go:258] successfully acquired lease default/foo.bar-baz
I1221 13:41:19.810940 1289631 event.go:285] Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"default", Name:"foo.bar-baz", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' dhcp129-178.brq.redhat.com_aaaa6be1-6281-4dd1-af24-e5928ad20806 became leader
I1221 13:41:19.810976 1289631 controller.go:811] Starting provisioner controller foo.bar/baz_dhcp129-178.brq.redhat.com_aaaa6be1-6281-4dd1-af24-e5928ad20806!
I1221 13:41:19.810985 1289631 event.go:285] Event(v1.ObjectReference{Kind:"Lease", Namespace:"default", Name:"foo.bar-baz", UID:"", APIVersion:"coordination.k8s.io/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' dhcp129-178.brq.redhat.com_aaaa6be1-6281-4dd1-af24-e5928ad20806 became leader
I1221 13:41:19.911085 1289631 controller.go:860] Started provisioner controller foo.bar/baz_dhcp129-178.brq.redhat.com_aaaa6be1-6281-4dd1-af24-e5928ad20806!
I1221 13:41:19.911127 1289631 controller.go:1366] provision "default/claim-1" class "class-1": started
I1221 13:41:19.911180 1289631 controller.go:1491] provision "default/claim-1" class "class-1": volume rescheduled because: nodes "node-1" not found
I1221 13:41:19.911183 1289631 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"claim-1", UID:"uid-1-1", APIVersion:"v1", ResourceVersion:"0", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' nodes "node-1" not found
I1221 13:41:19.911204 1289631 controller.go:1366] provision "default/claim-1" class "class-1": started
I1221 13:41:19.911213 1289631 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"claim-1", UID:"uid-1-1", APIVersion:"v1", ResourceVersion:"0", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "default/claim-1"
I1221 13:41:19.961387 1289631 controller.go:1449] provision "default/claim-1" class "class-1": volume "pvc-uid-1-1" provisioned
I1221 13:41:19.961415 1289631 controller.go:1462] provision "default/claim-1" class "class-1": succeeded
I1221 13:41:19.961440 1289631 volume_store.go:212] Trying to save persistentvolume "pvc-uid-1-1"
I1221 13:41:19.961473 1289631 volume_store.go:219] persistentvolume "pvc-uid-1-1" saved
I1221 13:41:19.961524 1289631 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"claim-1", UID:"uid-1-1", APIVersion:"v1", ResourceVersion:"0", FieldPath:""}): type: 'Normal' reason: 'ProvisioningSucceeded' Successfully provisioned volume pvc-uid-1-1
--- FAIL: TestTopologyParams (0.20s)
    --- FAIL: TestTopologyParams/provision_with_selected_node,_but_node_does_not_exist (0.20s)
        controller_test.go:971: did not expect a Provision() call but got at least 1
FAIL

The test should check that no provisioning is called when volume.kubernetes.io/selected-node refers to a node that does not exist. However, during the test the annotation is (correctly) removed and then syncClaim is called again, now with the PVC without any selected-node annotation and provisioning proceeds.

This may be actually a serious bug, the library may provision a topology-unaware volume, while topology was required.

git bisect shows that it's most probably caused by commit 438f6650928e0d957b3a330733cea9c45768a527, @sunnylovestiramisu

jsafrane commented 11 months ago

A failed CI run: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_sig-storage-lib-external-provisioner/158/pull-sig-storage-lib-external-provisioner-unit/1718183622202101760

mowangdk commented 11 months ago

A failed CI run: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_sig-storage-lib-external-provisioner/158/pull-sig-storage-lib-external-provisioner-unit/1718183622202101760

Seems like an concurrency issue? It looks like this error only occurs when the first request updates the cache after the second request has passed shouldProvisioner check

k8s-triage-robot commented 8 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

xing-yang commented 7 months ago

/remove-lifecycle stale