ICPetcdHighNumberOfFailedGRPCRequests triggers meaningless alert

haoqing0110 commented 5 years ago

Hi, all,

I'm from IBM Cloud Private team. In our environment, we found the alert ICPetcdHighNumberOfFailedGRPCRequests is frequently triggered (every 5 mintues).

In my investigation, I found the alert is triggered by a normal option. Every time etcdctl lease keep-alive $lease interacts with the etcd cluster, will trigger below log, then trigger ICPetcdHighNumberOfFailedGRPCRequests alert.

{"log":"2019-07-05 08:20:02.426997 D | etcdserver/api/v3rpc: failed to receive lease keepalive request from gRPC stream (\"rpc error: code = Unavailable desc = client disconnected\")\n","stream":"stderr","time":"2019-07-05T08:20:02.427136464Z"}

Seems alert rule is not meaningful if grpc_code="Unavailable" or grpc_method="LeaseKeepAlive" , so we would like to change https://github.com/ibm-cloud-architecture/CSMO-ICP/blob/master/prometheus/alerts_icp_2.1.0.2-3.1.1/alert-rules-icp311.yaml#L34 to below content .

     - alert: ICPetcdHighNumberOfFailedGRPCRequests
       annotations:
         message: 'etcd cluster "{{ $labels.job }}": {{ $value }}% of requests for {{
           $labels.grpc_method }} failed on etcd instance {{ $labels.instance }}.'
       expr: |
         100 * sum(rate(grpc_server_handled_total{job=~".*etcd.*", grpc_code!="OK", grpc_code!="Unavailable", grpc_method!="LeaseKeepAlive"}[5m])) BY (job, instance, grpc_service, grpc_method)
           /
         sum(rate(grpc_server_handled_total{job=~".*etcd.*"}[5m])) BY (job, instance, grpc_service, grpc_method)
           > 1
       for: 10m
       labels:
         severity: warning

Submit this issue to request for your opinion. We hope to make the change to avoid meaningless alert.

Someone meet similar issue in: https://github.com/openshift/cluster-monitoring-operator/issues/248

haoqing0110 commented 5 years ago

@RayStoner @RobertJBarron @rafal-szypulka can you help on this?

rafal-szypulka commented 5 years ago

@haoqing0110 it looks that the source of the problem is still unresolved etcd bug: https://github.com/etcd-io/etcd/issues/10289 and this problem exist not only in ICP, but also in openshift: https://github.com/etcd-io/etcd/pull/10629 In my opinion, this alert should be disabled until it will be resolved in etcd. Other option may be to filter-out grpc_code!="Unavailable", grpc_method!="LeaseKeepAlive" as you did, but I am not completely sure if it will get us meaningful results. I would just disable this alert rule for now.

haoqing0110 commented 5 years ago

@rafal-szypulka Thanks! Disable is good for our case.

ibm-cloud-architecture / CSMO-ICP

ICPetcdHighNumberOfFailedGRPCRequests triggers meaningless alert #8