Open haoqing0110 opened 5 years ago
@RayStoner @RobertJBarron @rafal-szypulka can you help on this?
@haoqing0110 it looks that the source of the problem is still unresolved etcd bug: https://github.com/etcd-io/etcd/issues/10289 and this problem exist not only in ICP, but also in openshift: https://github.com/etcd-io/etcd/pull/10629
In my opinion, this alert should be disabled until it will be resolved in etcd. Other option may be to filter-out grpc_code!="Unavailable", grpc_method!="LeaseKeepAlive"
as you did, but I am not completely sure if it will get us meaningful results. I would just disable this alert rule for now.
@rafal-szypulka Thanks! Disable is good for our case.
Hi, all,
I'm from IBM Cloud Private team. In our environment, we found the alert
ICPetcdHighNumberOfFailedGRPCRequests
is frequently triggered (every 5 mintues).In my investigation, I found the alert is triggered by a normal option. Every time
etcdctl lease keep-alive $lease
interacts with the etcd cluster, will trigger below log, then triggerICPetcdHighNumberOfFailedGRPCRequests
alert.Seems alert rule is not meaningful if
grpc_code="Unavailable"
orgrpc_method="LeaseKeepAlive"
, so we would like to change https://github.com/ibm-cloud-architecture/CSMO-ICP/blob/master/prometheus/alerts_icp_2.1.0.2-3.1.1/alert-rules-icp311.yaml#L34 to below content .Submit this issue to request for your opinion. We hope to make the change to avoid meaningless alert.
Someone meet similar issue in: https://github.com/openshift/cluster-monitoring-operator/issues/248