kubernetes-csi / external-attacher

Sidecar container that watches Kubernetes VolumeAttachment objects and triggers ControllerPublish/Unpublish against a CSI endpoint
Apache License 2.0
169 stars 187 forks source link

failed to ListVolumes with DeadlineExceeded #599

Closed okozachenko1203 closed 2 weeks ago

okozachenko1203 commented 2 weeks ago

I run cinder-csi plugin. I can see this error log every 5mins in csi-attacher container.

E1108 08:48:44.536465       1 controller.go:146] Failed to reconcile volume attachments: failed to ListVolumes: failed to list volumes: rpc error: code = DeadlineExceeded desc = context deadline exceeded

But in csi driver container, I cannot find any error messages at that time.

I1108 08:44:43.382332      11 utils.go:91] [ID:436] GRPC call: /csi.v1.Controller/ListVolumes
I1108 08:44:43.382397      11 controllerserver.go:283] ListVolumes: called with &csi.ListVolumesRequest{MaxEntries:0, StartingToken:"", XXX_NoUnkeyedLiteral:struct {}{}, XXX_unrecognized:[]uint8(nil), XXX_sizecache:0} request
I1108 08:44:44.529063      11 controllerserver.go:319] ListVolumes: completed with 83 entries and "" next token
I1108 08:45:44.534424      11 utils.go:91] [ID:438] GRPC call: /csi.v1.Controller/ListVolumes
I1108 08:45:44.534482      11 controllerserver.go:283] ListVolumes: called with &csi.ListVolumesRequest{MaxEntries:0, StartingToken:"", XXX_NoUnkeyedLiteral:struct {}{}, XXX_unrecognized:[]uint8(nil), XXX_sizecache:0} request
I1108 08:46:19.441046      11 controllerserver.go:319] ListVolumes: completed with 83 entries and "" next token
I1108 08:49:44.537303      11 utils.go:91] [ID:443] GRPC call: /csi.v1.Controller/ListVolumes
I1108 08:49:44.537335      11 controllerserver.go:283] ListVolumes: called with &csi.ListVolumesRequest{MaxEntries:0, StartingToken:"", XXX_NoUnkeyedLiteral:struct {}{}, XXX_unrecognized:[]uint8(nil), XXX_sizecache:0} request
I1108 08:49:45.585384      11 controllerserver.go:319] ListVolumes: completed with 83 entries and "" next token
I1108 08:50:45.593632      11 utils.go:91] [ID:445] GRPC call: /csi.v1.Controller/ListVolumes
I1108 08:50:45.593711      11 controllerserver.go:283] ListVolumes: called with &csi.ListVolumesRequest{MaxEntries:0, StartingToken:"", XXX_NoUnkeyedLiteral:struct {}{}, XXX_unrecognized:[]uint8(nil), XXX_sizecache:0} request
I1108 08:51:20.522579      11 controllerserver.go:319] ListVolumes: completed with 83 entries and "" next token

Is this false negative?

okozachenko1203 commented 2 weeks ago

We experience DeadlineExceeded for detach operation also. But there is no errors on external controller side and even on openstack side.

I am not sure what this error message means and which part is the problem.

Error processing "csi-ceada84ed1b630ff93671879d8f088378d2dd699437e906cf2ea78ce4cf48262": failed to detach: rpc error: code = DeadlineExceeded desc = context deadline exceeded
jsafrane commented 2 weeks ago

The operations apparently have different timeout in the attacher and in the CSI driver.

I don't know what timeouts you use, the default value of --timeout is 15 seconds. So the attacher issues a ListVolumes / ControllerUnpublish call and waits for 15 seconds for a response and then times out.

On the driver side, it looks like there is no timeout, and in your logs I can see the ListVolumes completes in 35 seconds. That's long past the attacher timeout. Our docs don't list ListVolumes call explicitly, but it behaves the same as ControllerPublish / ControllerUnpublish. The attacher timeout should be long enough to fit most ListVolume calls.

okozachenko1203 commented 2 weeks ago
I1108 08:46:19.441046      11 controllerserver.go:319] ListVolumes: completed with 83 entries and "" next token

@jsafrane here are the args for csi-attacher container in my deployment.

  containers:
  - args:
    - --csi-address=$(ADDRESS)
    - --timeout=3m
    - --leader-election=true
    - --default-fstype=ext4
    - -v=4
    env:
    - name: ADDRESS
      value: /var/lib/csi/sockets/pluginproxy/csi.sock
    image: registry.k8s.io/sig-storage/csi-attacher:v4.2.0

I think this timeout will be applied for all actions in attacher.

jsafrane commented 2 weeks ago

Can you please post complete attacher and corresponding driver logs with at least few ListVolume requests?

The code looks like it should use your 3 minute timeout for ListVolumes. It creates a new 3 min context at the beginning and passes it to ListVolume at the end:

https://github.com/kubernetes-csi/external-attacher/blob/5555d42f2a8dfb3117b690a33dcfd9b53d85bdde/pkg/controller/csi_handler.go#L122-L138

okozachenko1203 commented 2 weeks ago

Can you please post complete attacher and corresponding driver logs with at least few ListVolume requests?

The code looks like it should use your 3 minute timeout for ListVolumes. It creates a new 3 min context at the beginning and passes it to ListVolume at the end:

https://github.com/kubernetes-csi/external-attacher/blob/5555d42f2a8dfb3117b690a33dcfd9b53d85bdde/pkg/controller/csi_handler.go#L122-L138

I pasted csi-attacher and cinder-plugin logs. I filtered Found NodeID xxx in CSINode yyy logs in attacher log.

https://paste.opendev.org/show/blkDBjTauGVIQmtjfozT/ https://paste.opendev.org/show/bBh1zd9lzaaQu4qsKpSa/

okozachenko1203 commented 2 weeks ago

i can see that it was timed out after 3mins exactly in csi-attacher

I1108 12:06:03.004767       1 csi_handler.go:123] Reconciling VolumeAttachments with driver backend state
I1108 12:08:58.096259       1 reflector.go:559] k8s.io/client-go/informers/factory.go:150: Watch close - *v1.CSINode total 8 items received
E1108 12:09:03.005889       1 controller.go:146] Failed to reconcile volume attachments: failed to ListVolumes: failed to list volumes: rpc error: code = DeadlineExceeded desc = context deadline exceeded

In driver, ListVolumes completed in1m30s at 12:07:41.

I1108 12:06:03.005245      11 utils.go:91] [ID:716] GRPC call: /csi.v1.Controller/ListVolumes
I1108 12:06:03.005291      11 controllerserver.go:283] ListVolumes: called with &csi.ListVolumesRequest{MaxEntries:0, StartingToken:"", XXX_NoUnkeyedLiteral:struct {}{}, XXX_unrecognized:[]uint8(nil), XXX_sizecache:0} request
I1108 12:06:13.692517      11 utils.go:91] [ID:717] GRPC call: /csi.v1.Identity/Probe
I1108 12:07:13.691793      11 utils.go:91] [ID:718] GRPC call: /csi.v1.Identity/Probe
I1108 12:07:41.437185      11 controllerserver.go:319] ListVolumes: completed with 83 entries and "" next token
I1108 12:08:13.694518      11 utils.go:91] [ID:719] GRPC call: /csi.v1.Identity/Probe
I1108 12:09:13.693673      11 utils.go:91] [ID:720] GRPC call: /csi.v1.Identity/Probe
I1108 12:10:03.008628      11 utils.go:91] [ID:721] GRPC call: /csi.v1.Controller/ListVolumes
I1108 12:10:03.008671      11 controllerserver.go:283] ListVolumes: called with &csi.ListVolumesRequest{MaxEntries:0, StartingToken:"", XXX_NoUnkeyedLiteral:struct {}{}, XXX_unrecognized:[]uint8(nil), XXX_sizecache:0} request
I1108 12:10:03.520247      11 controllerserver.go:319] ListVolumes: completed with 83 entries and "" next token

2 strange things for me

jsafrane commented 2 weeks ago

There is at least one ListVolumes request that took >5 minutes on the CSI driver side:

The attacher:

I1108 12:02:03.001065       1 csi_handler.go:123] Reconciling VolumeAttachments with driver backend state

The driver:

I1108 12:02:03.001581      11 controllerserver.go:283] ListVolumes: called with &csi.ListVolumesRequest{MaxEntries:0, StartingToken:"", XXX_NoUnkeyedLiteral:struct {}{}, XXX_unrecognized:[]uint8(nil), XXX_sizecache:0} request
...
I1108 12:07:41.437185      11 controllerserver.go:319] ListVolumes: completed with 83 entries and "" next token
okozachenko1203 commented 2 weeks ago

it turned out that attacher works as expected but csi driver has some issue. need to investigate more.

Thanks @jsafrane