vmware / govmomi

Go library for the VMware vSphere API
Apache License 2.0
2.27k stars 897 forks source link

[BUG] Regression when moving from Task.Wait (now deprecated) to Task.WaitEx #3394

Closed fabriziopandini closed 3 months ago

fabriziopandini commented 3 months ago

Describe the bug After upgrading govmomi to v0.36.1 and moving from Task.Wait (now deprecated) to Task.WaitEx we are seeing multiple deadlocks when creating more than 20 machines in parallel, with concurrency 10.

All the deadlock looks like:

snippet of a thread in deadlock from a goroutine dump ``` goroutine 334 [select, 5 minutes]: net/http.(*persistConn).roundTrip(0x40013e46c0, 0x400079c840) /usr/local/go/src/net/http/transport.go:2652 +0x8ec net/http.(*Transport).roundTrip(0x400057f680, 0x40020dbd00) /usr/local/go/src/net/http/transport.go:604 +0xaa8 net/http.(*Transport).RoundTrip(0x400057f680, 0x40020dbc00) /usr/local/go/src/net/http/roundtrip.go:17 +0x3c net/http.send(0x40020dbc00, {0x325d6c0, 0x400057f680}, {0x0, 0x0, 0x0}) /usr/local/go/src/net/http/client.go:260 +0x3f4 net/http.(*Client).send(0x40006c62c0, 0x40020dbc00, {0x0, 0x0, 0x0}) /usr/local/go/src/net/http/client.go:181 +0xe0 net/http.(*Client).do(0x40006c62c0, 0x40020dbc00) /usr/local/go/src/net/http/client.go:724 +0xeac net/http.(*Client).Do(0x40006c62c0, 0x40020dbc00) /usr/local/go/src/net/http/client.go:590 +0x3c github.com/vmware/govmomi/vim25/soap.(*Client).Do(0x40006c62c0, {0x3285cc0, 0x40021130e0}, 0x40020dbb00, 0x4001131818) /Users/fpandini/go/pkg/mod/github.com/vmware/govmomi@v0.36.1/vim25/soap/client.go:590 +0x214 github.com/vmware/govmomi/vim25/soap.(*Client).soapRoundTrip(0x40006c62c0, {0x3285cc0, 0x40021130e0}, {0x325f2e0, 0x4001d4f0f8}, {0x325f2e0, 0x4001d4f110}) /Users/fpandini/go/pkg/mod/github.com/vmware/govmomi@v0.36.1/vim25/soap/client.go:704 +0xb70 github.com/vmware/govmomi/vim25/soap.(*Client).RoundTrip(0x40006c62c0, {0x3285cc0, 0x40021130e0}, {0x325f2e0, 0x4001d4f0f8}, {0x325f2e0, 0x4001d4f110}) /Users/fpandini/go/pkg/mod/github.com/vmware/govmomi@v0.36.1/vim25/soap/client.go:651 +0xac github.com/vmware/govmomi/session/keepalive.(*HandlerSOAP).RoundTrip(0x4000a9bc20, {0x3285cc0, 0x40021130e0}, {0x325f2e0, 0x4001d4f0f8}, {0x325f2e0, 0x4001d4f110}) /Users/fpandini/go/pkg/mod/github.com/vmware/govmomi@v0.36.1/session/keepalive/handler.go:163 +0xe8 github.com/vmware/govmomi/vim25.(*Client).RoundTrip(0x4000682600, {0x3285cc0, 0x40021130e0}, {0x325f2e0, 0x4001d4f0f8}, {0x325f2e0, 0x4001d4f110}) /Users/fpandini/go/pkg/mod/github.com/vmware/govmomi@v0.36.1/vim25/client.go:129 +0x6c github.com/vmware/govmomi/vim25/methods.WaitForUpdatesEx({0x3285cc0, 0x40021130e0}, {0x325e000, 0x4000682600}, 0x400164d810) /Users/fpandini/go/pkg/mod/github.com/vmware/govmomi@v0.36.1/vim25/methods/methods.go:18599 +0x144 github.com/vmware/govmomi/property.(*Collector).WaitForUpdatesEx(0x400164d400, {0x3285cc0, 0x40021130e0}, {0x0, 0x1, 0x0}, 0x4001132228) /Users/fpandini/go/pkg/mod/github.com/vmware/govmomi@v0.36.1/property/collector.go:273 +0x274 github.com/vmware/govmomi/property.WaitForUpdatesEx({0x3285cc0, 0x40021130e0}, 0x400164d400, 0x4001132268, 0x4001132228) /Users/fpandini/go/pkg/mod/github.com/vmware/govmomi@v0.36.1/property/wait.go:173 +0x1cc github.com/vmware/govmomi/task.WaitEx({0x3285cc0, 0x40021130e0}, {{0x4001d48a00, 0x4}, {0x4001d48a20, 0x8}, {0x0, 0x0}}, 0x400164d400, {0x0, ...}) /Users/fpandini/go/pkg/mod/github.com/vmware/govmomi@v0.36.1/task/wait.go:137 +0x38c github.com/vmware/govmomi/object.(*Task).WaitForResultEx(0x400164d3b0, {0x3285cc0, 0x40021130e0}, {0x40011324f0, 0x1, 0x1}) /Users/fpandini/go/pkg/mod/github.com/vmware/govmomi@v0.36.1/object/task.go:93 +0x15c github.com/vmware/govmomi/object.(*Task).WaitEx(0x400164d3b0, {0x3285cc0, 0x40021130e0}) /Users/fpandini/go/pkg/mod/github.com/vmware/govmomi@v0.36.1/object/task.go:83 +0x9c sigs.k8s.io/cluster-api-provider-vsphere/test/infrastructure/vcsim/controllers.(*vmIPReconciler).ReconcileIP(0x4000c36bc0, {0x3285cc0, 0x40021130e0}) /Users/fpandini/go/src/sigs.k8s.io/cluster-api-provider-vsphere/test/infrastructure/vcsim/controllers/vmip_controller.go:82 +0x73c sigs.k8s.io/cluster-api-provider-vsphere/test/infrastructure/vcsim/controllers.(*VirtualMachineReconciler).reconcileNormal(0x40001b5860, {0x3285cc0, 0x40021130e0}, 0x4001b8cd00, 0x4001a62f00, 0x4000c11500, 0x4000b38a80) /Users/fpandini/go/src/sigs.k8s.io/cluster-api-provider-vsphere/test/infrastructure/vcsim/controllers/virtualmachine_controller.go:232 +0x70 sigs.k8s.io/cluster-api-provider-vsphere/test/infrastructure/vcsim/controllers.(*VirtualMachineReconciler).Reconcile(0x40001b5860, {0x3285cc0, 0x40021130e0}, {{{0x4001ebaf10, 0x7}, {0x4001edd340, 0x19}}}) /Users/fpandini/go/src/sigs.k8s.io/cluster-api-provider-vsphere/test/infrastructure/vcsim/controllers/virtualmachine_controller.go:227 +0x14cc sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x400089d180, {0x3285cc0, 0x4002112de0}, {{{0x4001ebaf10, 0x7}, {0x4001edd340, 0x19}}}) /Users/fpandini/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:119 +0x168 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0x400089d180, {0x3285cc0, 0x4002112de0}, {0x2c55b40, 0x40009e2460}) /Users/fpandini/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:316 +0x3bc sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0x400089d180, {0x3285cf8, 0x4000467540}) /Users/fpandini/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:266 +0x33c sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2() /Users/fpandini/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:227 +0xd0 created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2 in goroutine 201 /Users/fpandini/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:223 +0x7bc ```

I can provide the full goroutine dump if necessary

To Reproduce We detected this while testing locally CAPV in supervisor mode, with vcsim as a target environment. From an empiric observation, the deadlock happens on an average of 10 Machines over 20 (50% of cases), in some cases up to 15/20.

We also tested without keepalive, but it doesn't seems having impact on the issue

Expected behavior No deadlock to happen

Affected version govmomi to v0.36.1

Screenshots/Debug Output If applicable, add screenshots or debug output to help explain your problem.

Additional context From a quick search we assume that the regression has been introduced by this commit cc @akutz

If this is correct, then we would like to know what is the best course of action, considering that the commit above seems to be changing also the implementation of task.WaitForUpdates.

a. Rollback to govimomi v0.34.2 (latest release in use before the change) b. Keep using v0.36.1 but rollback from Task.WaitEx/WaitForUpdatesEx to Task.Wait/WaitForUpdates (assuming we got confirmation that the changes from the commit above do not introduce deadlock) c. Wait for a fix - in case it could be interesting an ETA to verify impacts on upcoming CAPV releases -

cc @chrischdi @sbueringer

dougm commented 3 months ago

Rollback to Task.Wait/WaitForUpdates should avoid the deadlock, as those methods create a thread-safe PropertyCollector instance per-call, as they had always done. The new Ex versions use the same PropertyCollector instance in each call, which may require client-side changes to be thread-safe. (Goal was reducing resource usage on the vCenter side) Did CAPV just change to the Ex method names? We may want to reconsider the deprecation, if this change was just prompted by linter warnings.

sbueringer commented 3 months ago

Did CAPV just change to the Ex method names? We may want to reconsider the deprecation, if this change was just prompted by linter warnings.

Yes we moved to the *Ex methods because of linter findings. Otherwise we probably wouldn't have noticed (https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/pull/2735)

We didn't see godoc on the methods, just // Deprecated: Please use WaitEx instead.. So we assumed it's safe to just use the new methods instead.

fabriziopandini commented 3 months ago

@dougm thanks for the quick answer! I think removing the deprecation and adding some comments helping in understanding when to use one or the other method will be great. I will give a try to it ASAP if someone doesn't get to it before me

fabriziopandini commented 3 months ago

/assign