openshift / origin

Conformance test suite for OpenShift
http://www.openshift.org
Apache License 2.0
8.49k stars 4.7k forks source link

etcd - All grpc_code for grpc_method "Watch" is "Unavailable" #20311

Open Reamer opened 6 years ago

Reamer commented 6 years ago

Hi, I noticed, that every grpc_code for grpc_method "Watch" is "Unavailable" in my okd cluster. My plan is to monitor etcd-instances with default prometheus alerts from the etcd-project. Maybe the watch-connection is not closed correctly and goes into an timeout.

Version
Client Version: 4.7.18
Server Version: 4.7.0-0.okd-2021-08-22-163618
Kubernetes Version: v1.20.0-1093+4593a24e8fd58d-dirty
Steps To Reproduce
  1. install okd 4.7
  2. Switch to etcd project oc project openshift-etcd
  3. Log in to the first etcd member oc rsh etcd-master1.mycompany.com
  4. curl -s --cacert "/etc/kubernetes/static-pod-certs/configmaps/etcd-serving-ca/ca-bundle.crt" --cert "/etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-master1.mycompany.com.crt" --key "/etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-master1.mycompany.com.key" https://localhost:2379/metrics
Current Result
grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"} 1434
Expected Result
grpc_server_handled_total{grpc_code="OK",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"} 1434

Additional Information

If that behavior is already fixed or it's a false positive, let me know.

jwforres commented 6 years ago

@openshift/sig-master

Reamer commented 6 years ago

Still present with 3.10

oc v3.10.0+0c4577e-1
kubernetes v1.10.0+b81c8f8
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://s-cp-lb-01.cloud.example.de:443
openshift v3.10.0+7eee6f8-2
kubernetes v1.10.0+b81c8f8
openshift-bot commented 6 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

vsliouniaev commented 6 years ago

+1 on this. We've disabled this alert on our setup because it's just flapping and not indicating any failures.

Reamer commented 6 years ago

/remove-lifecycle stale

gaopeiliang commented 6 years ago

+1 on this , I also found it on etcd cluster master node , when add etcd3_alert.rules ..

image

it will cycle five mintue ... but we can't find something wrong with k8s ....

gaopeiliang commented 6 years ago

/remove-lifecycle stale

arslanbekov commented 5 years ago

+1. I run etcd with debug log lever, and find this error:

etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = stream error: stream ID 71; CANCEL")

errors about 1 time ~ in 5 minutes, stream ID - unique

etcd 3.2.24 / 3.2.25 / 3.3.10 Monitoring with prometheus (i getting this allert).

Any updates?

judexzhu commented 5 years ago

+1, ectd 3.3.10 with Prometheus Operator on Kubernetes 1.11.5

I have 5 nodes, but only one node having the alert, Others seem fine.

the etcd cluster runs well without issue.

image

zqyangchn commented 5 years ago

image

zqyangchn commented 5 years ago

/remove-lifecycle stale

openshift-bot commented 5 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Reamer commented 5 years ago

Still reproducible on Origin 3.11

Reamer commented 5 years ago

/remove-lifecycle stale

Reamer commented 5 years ago

Relates to https://github.com/openshift/cluster-monitoring-operator/pull/340 and https://github.com/etcd-io/etcd/issues/10289

openshift-bot commented 5 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Reamer commented 5 years ago

/remove-lifecycle stale Still present

openshift-bot commented 4 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Reamer commented 4 years ago

/lifecycle frozen /remove-lifecycle stale

hexfusion commented 4 years ago

/assign

Joseph94m commented 3 years ago

Any news about this ?

Reamer commented 3 years ago

At the moment I am using okd 4.7 and this bug is still present. Prometheus-Query:

grpc_server_handled_total{grpc_code="Unavailable",grpc_service="etcdserverpb.Watch"}