kubernetes-sigs / aws-ebs-csi-driver

CSI driver for Amazon EBS https://aws.amazon.com/ebs/
Apache License 2.0
997 stars 800 forks source link

Request for Information: PVC Creation Time Metrics in ebs-csi-driver #1873

Closed atilsensalduz closed 11 months ago

atilsensalduz commented 11 months ago

I am currently exploring options for monitoring PVC creation times in the ebs-csi-driver, with the goal of setting up alerts if the process exceeds a certain duration, such as 5 minutes.

Upon reviewing the available metrics, I noticed the existence of the cloudprovider_aws_api_request_duration_seconds_bucket metric, and I'm wondering if this metric can be utilized to measure the time taken for PVC creation. Could you please provide more details on this metric and clarify if it can be used for tracking PVC creation times?

Additionally, I'm open to exploring alternative metrics or approaches that you may recommend for effectively monitoring PVC creation times or any other useful metrics for follow health of ebs-csi-driver functionalities and health of infrastructure in terms of managing pvs

torredil commented 11 months ago

Hi @atilsensalduz 👋

The cloudprovider_aws_api_request_duration_seconds_bucket metric is useful when measuring the latency (time it takes for AWS to acknowledge the request) of AWS API calls made by the driver, such as AttachVolume. The latency does not account for the full lifecycle of the operation - after the request is ack'd, it will take some amount of time for the volume to transition to attached and so on. See https://docs.aws.amazon.com/AWSEC2/latest/APIReference/query-api-troubleshooting.html#eventual-consistency for more details.

To accurately measure the time it takes to create a volume, you'll want to look at csi_sidecar_operations_seconds_sum, example:

csi_sidecar_operations_seconds_bucket{driver_name="ebs.csi.aws.com",grpc_status_code="OK",method_name="/csi.v1.Controller/CreateVolume",migrated="false",le="0.1"} 0
csi_sidecar_operations_seconds_bucket{driver_name="ebs.csi.aws.com",grpc_status_code="OK",method_name="/csi.v1.Controller/CreateVolume",migrated="false",le="0.25"} 0
csi_sidecar_operations_seconds_bucket{driver_name="ebs.csi.aws.com",grpc_status_code="OK",method_name="/csi.v1.Controller/CreateVolume",migrated="false",le="0.5"} 0
csi_sidecar_operations_seconds_bucket{driver_name="ebs.csi.aws.com",grpc_status_code="OK",method_name="/csi.v1.Controller/CreateVolume",migrated="false",le="1"} 0
csi_sidecar_operations_seconds_bucket{driver_name="ebs.csi.aws.com",grpc_status_code="OK",method_name="/csi.v1.Controller/CreateVolume",migrated="false",le="2.5"} 0
csi_sidecar_operations_seconds_bucket{driver_name="ebs.csi.aws.com",grpc_status_code="OK",method_name="/csi.v1.Controller/CreateVolume",migrated="false",le="5"} 1
csi_sidecar_operations_seconds_bucket{driver_name="ebs.csi.aws.com",grpc_status_code="OK",method_name="/csi.v1.Controller/CreateVolume",migrated="false",le="10"} 1
csi_sidecar_operations_seconds_bucket{driver_name="ebs.csi.aws.com",grpc_status_code="OK",method_name="/csi.v1.Controller/CreateVolume",migrated="false",le="15"} 1
csi_sidecar_operations_seconds_bucket{driver_name="ebs.csi.aws.com",grpc_status_code="OK",method_name="/csi.v1.Controller/CreateVolume",migrated="false",le="25"} 1
csi_sidecar_operations_seconds_bucket{driver_name="ebs.csi.aws.com",grpc_status_code="OK",method_name="/csi.v1.Controller/CreateVolume",migrated="false",le="50"} 1
csi_sidecar_operations_seconds_bucket{driver_name="ebs.csi.aws.com",grpc_status_code="OK",method_name="/csi.v1.Controller/CreateVolume",migrated="false",le="120"} 1
csi_sidecar_operations_seconds_bucket{driver_name="ebs.csi.aws.com",grpc_status_code="OK",method_name="/csi.v1.Controller/CreateVolume",migrated="false",le="300"} 1
csi_sidecar_operations_seconds_bucket{driver_name="ebs.csi.aws.com",grpc_status_code="OK",method_name="/csi.v1.Controller/CreateVolume",migrated="false",le="600"} 1
csi_sidecar_operations_seconds_bucket{driver_name="ebs.csi.aws.com",grpc_status_code="OK",method_name="/csi.v1.Controller/CreateVolume",migrated="false",le="+Inf"} 1
csi_sidecar_operations_seconds_sum{driver_name="ebs.csi.aws.com",grpc_status_code="OK",method_name="/csi.v1.Controller/CreateVolume",migrated="false"} 3.303456169
csi_sidecar_operations_seconds_count{driver_name="ebs.csi.aws.com",grpc_status_code="OK",method_name="/csi.v1.Controller/CreateVolume",migrated="false"} 1

To enable this metric, --http-endpoint needs to be defined for the external provisioner sidecar. Currently, you would be able to do via the additionalArgs helm param: https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/09f742f7a545ea4d1d5fef333715e25cd2064c0d/charts/aws-ebs-csi-driver/values.yaml#L25

atilsensalduz commented 11 months ago

Wow, that's fantastic! Thanks a lot @torredil! I'm currently managing the EBS CSI driver as an EKS add-on using Terraform. Could you please review the following configuration? Let me know if there are any corrections needed:

{
  "sidecars": {
     "provisioner": {
       "additionalArgs": [
         "--http-endpoint=0.0.0.0:8080"
        ]
        }
    }
}
torredil commented 11 months ago

Could you please review the following configuration? Let me know if there are any corrections needed

You got it mate, no corrections needed 👍

As a quick sanity check, you should be able to see the relevant metrics by going through this exercise:

  1. Grab controller pod that has an active csi-provisioner and port-forward:
    
    $ export ebs_csi_controller=$(kubectl get pods -n kube-system -o custom-columns=NAME:.metadata.name | grep ebs-csi-controller | while read podname; do if kubectl logs $podname -n kube-system -c csi-provisioner | grep -q "successfully acquired lease"; then echo $podname; fi; done) && kubectl port-forward $ebs_csi_controller 8080:8080 -n kube-system

Forwarding from 127.0.0.1:8080 -> 8080 Forwarding from [::1]:8080 -> 8080 Handling connection for 8080 Handling connection for 8080


2. Grab logs:

$ curl 0.0.0.0:8080/metrics | grep "CreateVolume"

% Total % Received % Xferd Average Speed Time Time Time Current csi_sidecar_operations_seconds_bucket{driver_name="ebs.csi.aws.com",grpc_status_code="OK",method_name="/csi.v1.Controller/CreateVolume",migrated="false",le="0.1"} 0 csi_sidecar_operations_seconds_bucket{driver_name="ebs.csi.aws.com",grpc_status_code="OK",method_name="/csi.v1.Controller/CreateVolume",migrated="false",le="0.25"} 0 csi_sidecar_operations_seconds_bucket{driver_name="ebs.csi.aws.com",grpc_status_code="OK",method_name="/csi.v1.Controller/CreateVolume",migrated="false",le="0.5"} 0 csi_sidecar_operations_seconds_bucket{driver_name="ebs.csi.aws.com",grpc_status_code="OK",method_name="/csi.v1.Controller/CreateVolume",migrated="false",le="1"} 0 csi_sidecar_operations_seconds_bucket{driver_name="ebs.csi.aws.com",grpc_status_code="OK",method_name="/csi.v1.Controller/CreateVolume",migrated="false",le="2.5"} 0 csi_sidecar_operations_seconds_bucket{driver_name="ebs.csi.aws.com",grpc_status_code="OK",method_name="/csi.v1.Controller/CreateVolume",migrated="false",le="5"} 2 csi_sidecar_operations_seconds_bucket{driver_name="ebs.csi.aws.com",grpc_status_code="OK",method_name="/csi.v1.Controller/CreateVolume",migrated="false",le="10"} 2 csi_sidecar_operations_seconds_bucket{driver_name="ebs.csi.aws.com",grpc_status_code="OK",method_name="/csi.v1.Controller/CreateVolume",migrated="false",le="15"} 2 csi_sidecar_operations_seconds_bucket{driver_name="ebs.csi.aws.com",grpc_status_code="OK",method_name="/csi.v1.Controller/CreateVolume",migrated="false",le="25"} 2 csi_sidecar_operations_seconds_bucket{driver_name="ebs.csi.aws.com",grpc_status_code="OK",method_name="/csi.v1.Controller/CreateVolume",migrated="false",le="50"} 2 csi_sidecar_operations_seconds_bucket{driver_name="ebs.csi.aws.com",grpc_status_code="OK",method_name="/csi.v1.Controller/CreateVolume",migrated="false",le="120"} 2 csi_sidecar_operations_seconds_bucket{driver_name="ebs.csi.aws.com",grpc_status_code="OK",method_name="/csi.v1.Controller/CreateVolume",migrated="false",le="300"} 2 csi_sidecar_operations_seconds_bucket{driver_name="ebs.csi.aws.com",grpc_status_code="OK",method_name="/csi.v1.Controller/CreateVolume",migrated="false",le="600"} 2 csi_sidecar_operations_seconds_bucket{driver_name="ebs.csi.aws.com",grpc_status_code="OK",method_name="/csi.v1.Controller/CreateVolume",migrated="false",le="+Inf"} 2 csi_sidecar_operations_seconds_sum{driver_name="ebs.csi.aws.com",grpc_status_code="OK",method_name="/csi.v1.Controller/CreateVolume",migrated="false"} 8.772878436 csi_sidecar_operations_seconds_count{driver_name="ebs.csi.aws.com",grpc_status_code="OK",method_name="/csi.v1.Controller/CreateVolume",migrated="false"} 2

atilsensalduz commented 11 months ago

Hey @torredil

Just wanted to drop a quick note to say thanks for your awesome help with the issue.

Really appreciate your quick response and expertise. You rock! 🚀

Cheers,