Closed wnxn closed 4 years ago
/assign @pohly
Thanks for the detailed analysis, that helped a lot. The keepalive feature was enabled for testing over TCP because we have some tests in a custom testsuite in the PMEM-CSI driver where we intentionally reboot the target system and gRPC did not detect that without keepalive.
As the comment says (https://github.com/kubernetes-csi/csi-test/blob/d305288bda09cadb1b0008958757ac56722bf60a/utils/grpcutil.go#L44-L46), that seemed to work fine also for Unix Domain Sockets, but as you pointed out here, that isn't quite the case. Let's change this so that keepalive is not enabled for UDS, okay?
/assign
Let's change this so that keepalive is not enabled for UDS, okay?
Hmm, but then it still fails for TCP when the driver is slow.
I did not test it for TCP. Do you reproduce it for TCP?
I wrote a small program to test this issue and reproduce it for TCP just now. https://github.com/wnxn/gostudy/blob/master/pkg/grpc/csi-client/main.go
# ./main -endpoint "tcp://localhost:8000"
E1111 17:05:28.727709 12508 main.go:46] rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp: address tcp://localhost:8000: too many colons in address"
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x814ac1]
goroutine 1 [running]: main.DeleteSnapshot(0x9d4040, 0xc0000100a0) /root/mygo/src/github.com/wnxn/gostudy/pkg/grpc/csi-client/main.go:49 +0x1e1 main.main() /root/mygo/src/github.com/wnxn/gostudy/pkg/grpc/csi-client/main.go:28 +0x239
- After I remove the keepalive dial option, it successes.
I1111 17:14:36.125973 23093 main.go:63] I1111 17:14:47.833281 23093 main.go:71]
did not test it for TCP. Do you reproduce it for TCP?
Yes. I had to enhance the mock driver first such that it also allows TCP, but then your reproducer fails also in that mode.
Re-reading https://godoc.org/google.golang.org/grpc/keepalive#ClientParameters I found a comment (again) about matching it with the server options. I think when I read that the first time, I interpreted it as "timeout values must match" and assumed that the defaults should be fine on the client, as we don't change those on the server. My second reading made me wonder whether it means that keepalive must be enabled explicitly on the server. But I just tried that and it doesn't fix the issue.
At this point I am tempted to just turn the canonical client/server example from gRPC into a reproducer and report this upstream.... done: https://github.com/grpc/grpc-go/issues/3171
As a hot-fix we could make keepalive optional, but I'd prefer to fully understand the problem first.
Thanks for reporting this upstream. I agree with you that fully understand the problem. It seems urgently to release v3.0.0. Shall I fix it by making keepalive optional now?
I can do that after https://github.com/kubernetes-csi/csi-test/pull/233 got merged. Otherwise we'll just have merge conflicts.
What happened:
When I use csi-sanity test to test my CSI driver through UDS, the gRPC transport is closed in DeleteSnapshot testcase and the sanity test reports failed in the testcase.
After the problem shown as above occuring, the subsequent testcases failed for the TransientFailure.
What you expected to happen:
For the first problem, it is should not report transport is closing. For the second problem, the transport closing problem should not affect subsequent test cases.
How to reproduce it (as minimally and precisely as possible):
Add
time.Sleep(100*time.Second)
in ControllerDeleteSnapshot https://github.com/kubernetes-csi/csi-test/blob/d305288bda09cadb1b0008958757ac56722bf60a/mock/service/controller.go#L504For save time, we could use
FIt
to run focused testcase. https://github.com/kubernetes-csi/csi-test/blob/d305288bda09cadb1b0008958757ac56722bf60a/pkg/sanity/controller.go#L1909Run
./hack/e2e.sh
Output:
Anything else we need to know?:
After my experiment, I believe it caused by two reasons.
My CSI driver DeleteSnapshot RPC call taking too much time about 100 seconds in some particular circumstance. But, in my opinion, it is reasonable in the automatic test. For example, storage system must ensure leasing info ready before deleting snapshot.
gRPC keepalive mechanism.
In my opinion, For the first problem, I tried to remove keepalive dial option and the transport is not closing anymore. https://github.com/kubernetes-csi/csi-test/blob/d305288bda09cadb1b0008958757ac56722bf60a/utils/grpcutil.go#L47-L48 So, we could add a flag to add keepalive to dial options and disable it on default.
For the second problem, we can update gRPC dependency to the latest released version. https://github.com/grpc/grpc-go/pull/2669
Environment:
cat /etc/os-release
): Ubuntu 16.04