wnxn commented 4 years ago

What happened:

When I use csi-sanity test to test my CSI driver through UDS, the gRPC transport is closed in DeleteSnapshot testcase and the sanity test reports failed in the testcase.

------------------------------
• Failure [80.008 seconds]
DeleteSnapshot [Controller Server]
/Users/wangxin/mygo/src/github.com/kubernetes-csi/csi-test/pkg/sanity/tests.go:44
should return appropriate values (no optional values added) [It]
/Users/wangxin/mygo/src/github.com/kubernetes-csi/csi-test/pkg/sanity/controller.go:1909

Unexpected error:
  <*status.statusError | 0xc00001e2d0>: {
      Code: 14,
      Message: "transport is closing",
      Details: nil,
      XXX_NoUnkeyedLiteral: {},
      XXX_unrecognized: nil,
      XXX_sizecache: 0,
  }
  rpc error: code = Unavailable desc = transport is closing
occurred

/Users/wangxin/mygo/src/github.com/kubernetes-csi/csi-test/pkg/sanity/controller.go:1927
------------------------------

After the problem shown as above occuring, the subsequent testcases failed for the TransientFailure.

Unexpected error:
  <*status.statusError | 0xc00001e820>: {
      Code: 14,
      Message: "all SubConns are in TransientFailure, latest connection error: <nil>",
      Details: nil,
      XXX_NoUnkeyedLiteral: {},
      XXX_unrecognized: nil,
      XXX_sizecache: 0,
  }
  rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: <nil>
occurred

What you expected to happen:

For the first problem, it is should not report transport is closing. For the second problem, the transport closing problem should not affect subsequent test cases.

How to reproduce it (as minimally and precisely as possible):

Add time.Sleep(100*time.Second) in ControllerDeleteSnapshot https://github.com/kubernetes-csi/csi-test/blob/d305288bda09cadb1b0008958757ac56722bf60a/mock/service/controller.go#L504
For save time, we could use FIt to run focused testcase. https://github.com/kubernetes-csi/csi-test/blob/d305288bda09cadb1b0008958757ac56722bf60a/pkg/sanity/controller.go#L1909
Run ./hack/e2e.sh

Output:

...
gRPCCall: {"Method":"/csi.v1.Controller/DeleteVolume","Request":{"volume_id":"14"},"Response":{},"Error":""}
STEP: reusing connection to CSI driver at /tmp/e2e-csi-sanity.sock
STEP: creating mount and staging directories
STEP: creating a volume
STEP: creating a snapshot
STEP: creating a volume from source snapshot
STEP: cleaning up deleting the volume created from snapshot
STEP: cleaning up deleting the snapshot

------------------------------
• Failure [79.925 seconds]
Controller Service [Controller Server]
/root/mygo/src/github.com/kubernetes-csi/csi-test/v3/pkg/sanity/tests.go:44
  CreateVolume
  /root/mygo/src/github.com/kubernetes-csi/csi-test/v3/pkg/sanity/controller.go:390
    should create volume from an existing source snapshot [It]
    /root/mygo/src/github.com/kubernetes-csi/csi-test/v3/pkg/sanity/controller.go:722

    Unexpected error:
        <*status.statusError | 0xc000357540>: {
            Code: 14,
            Message: "transport is closing",
            Details: nil,
            XXX_NoUnkeyedLiteral: {},
            XXX_unrecognized: nil,
            XXX_sizecache: 0,
        }
        rpc error: code = Unavailable desc = transport is closing
    occurred

    /root/mygo/src/github.com/kubernetes-csi/csi-test/v3/pkg/sanity/controller.go:762
------------------------------
STEP: reusing connection to CSI driver at /tmp/e2e-csi-sanity.sock
STEP: creating mount and staging directories
• Failure in Spec Setup (BeforeEach) [0.001 seconds]
Controller Service [Controller Server]
/root/mygo/src/github.com/kubernetes-csi/csi-test/v3/pkg/sanity/tests.go:44
  CreateVolume
  /root/mygo/src/github.com/kubernetes-csi/csi-test/v3/pkg/sanity/controller.go:390
    should fail when the volume source snapshot is not found [BeforeEach]
    /root/mygo/src/github.com/kubernetes-csi/csi-test/v3/pkg/sanity/controller.go:770

    Unexpected error:
        <*status.statusError | 0xc000357680>: {
            Code: 14,
            Message: "all SubConns are in TransientFailure, latest connection error: <nil>",
            Details: nil,
            XXX_NoUnkeyedLiteral: {},
            XXX_unrecognized: nil,
            XXX_sizecache: 0,
        }
        rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: <nil>
    occurred

    /root/mygo/src/github.com/kubernetes-csi/csi-test/v3/pkg/sanity/controller.go:82
------------------------------

Anything else we need to know?:

After my experiment, I believe it caused by two reasons.

My CSI driver DeleteSnapshot RPC call taking too much time about 100 seconds in some particular circumstance. But, in my opinion, it is reasonable in the automatic test. For example, storage system must ensure leasing info ready before deleting snapshot.

gRPC keepalive mechanism.

gRPC server side logs

ERROR: 2019/11/09 07:11:46 transport: Got too many pings from the client, closing the connection.
ERROR: 2019/11/09 07:11:46 transport: loopyWriter.run returning. Err: transport: Connection closing
WARNING: 2019/11/09 07:11:46 transport: http2Server.HandleStreams failed to read frame: read unix //root/csi.sock->@: use of closed network connection

gRPC client side logs

SSSSSSSSSSSSSSSSSSSSSSSSSSSSSWARNING: 2019/11/09 07:10:26 Adjusting keepalive ping interval to minimum period of 10s
INFO: 2019/11/09 07:10:26 parsed scheme: ""
INFO: 2019/11/09 07:10:26 scheme "" not registered, fallback to default scheme
INFO: 2019/11/09 07:10:26 ccResolverWrapper: sending new addresses to cc: [{/root/csi.sock 0  <nil>}]
INFO: 2019/11/09 07:10:26 ClientConn switching balancer to "pick_first"
INFO: 2019/11/09 07:10:26 pickfirstBalancer: HandleSubConnStateChange: 0xc0003d2010, CONNECTING
INFO: 2019/11/09 07:10:26 pickfirstBalancer: HandleSubConnStateChange: 0xc0003d2010, READY
INFO: 2019/11/09 07:11:46 Client received GoAway with http2.ErrCodeEnhanceYourCalm.
INFO: 2019/11/09 07:11:46 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/11/09 07:11:46 pickfirstBalancer: HandleSubConnStateChange: 0xc0003d2010, TRANSIENT_FAILURE
STEP: connecting to CSI driver
STEP: creating mount and staging directories
STEP: creating a volume
STEP: creating a snapshot
STEP: cleaning up deleting the snapshot

In my opinion, For the first problem, I tried to remove keepalive dial option and the transport is not closing anymore. https://github.com/kubernetes-csi/csi-test/blob/d305288bda09cadb1b0008958757ac56722bf60a/utils/grpcutil.go#L47-L48 So, we could add a flag to add keepalive to dial options and disable it on default.

For the second problem, we can update gRPC dependency to the latest released version. https://github.com/grpc/grpc-go/pull/2669

Environment:

OS (e.g: cat /etc/os-release): Ubuntu 16.04
csi-test: latest version of master branch

wnxn commented 4 years ago

/assign @pohly

pohly commented 4 years ago

Thanks for the detailed analysis, that helped a lot. The keepalive feature was enabled for testing over TCP because we have some tests in a custom testsuite in the PMEM-CSI driver where we intentionally reboot the target system and gRPC did not detect that without keepalive.

As the comment says (https://github.com/kubernetes-csi/csi-test/blob/d305288bda09cadb1b0008958757ac56722bf60a/utils/grpcutil.go#L44-L46), that seemed to work fine also for Unix Domain Sockets, but as you pointed out here, that isn't quite the case. Let's change this so that keepalive is not enabled for UDS, okay?

pohly commented 4 years ago

/assign

pohly commented 4 years ago

Let's change this so that keepalive is not enabled for UDS, okay?

Hmm, but then it still fails for TCP when the driver is slow.

wnxn commented 4 years ago

I did not test it for TCP. Do you reproduce it for TCP?

wnxn commented 4 years ago

I wrote a small program to test this issue and reproduce it for TCP just now. https://github.com/wnxn/gostudy/blob/master/pkg/grpc/csi-client/main.go

When the driver is slow, it still fails.


# ./main -endpoint "tcp://localhost:8000"
E1111 17:05:28.727709   12508 main.go:46] rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp: address tcp://localhost:8000: too many colons in address"
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x814ac1]

goroutine 1 [running]: main.DeleteSnapshot(0x9d4040, 0xc0000100a0) /root/mygo/src/github.com/wnxn/gostudy/pkg/grpc/csi-client/main.go:49 +0x1e1 main.main() /root/mygo/src/github.com/wnxn/gostudy/pkg/grpc/csi-client/main.go:28 +0x239


- After I remove the keepalive dial option, it successes.

./main -endpoint "localhost:8000"

I1111 17:14:36.125973 23093 main.go:63] I1111 17:14:47.833281 23093 main.go:71]

pohly commented 4 years ago

did not test it for TCP. Do you reproduce it for TCP?

Yes. I had to enhance the mock driver first such that it also allows TCP, but then your reproducer fails also in that mode.

Re-reading https://godoc.org/google.golang.org/grpc/keepalive#ClientParameters I found a comment (again) about matching it with the server options. I think when I read that the first time, I interpreted it as "timeout values must match" and assumed that the defaults should be fine on the client, as we don't change those on the server. My second reading made me wonder whether it means that keepalive must be enabled explicitly on the server. But I just tried that and it doesn't fix the issue.

At this point I am tempted to just turn the canonical client/server example from gRPC into a reproducer and report this upstream.... done: https://github.com/grpc/grpc-go/issues/3171

As a hot-fix we could make keepalive optional, but I'd prefer to fully understand the problem first.

wnxn commented 4 years ago

Thanks for reporting this upstream. I agree with you that fully understand the problem. It seems urgently to release v3.0.0. Shall I fix it by making keepalive optional now?

pohly commented 4 years ago

I can do that after https://github.com/kubernetes-csi/csi-test/pull/233 got merged. Otherwise we'll just have merge conflicts.

kubernetes-csi / csi-test

Disable gRPC keepalive dial option on default #238

./main -endpoint "localhost:8000"