knative / serving

Kubernetes-based, scale-to-zero, request-driven compute
https://knative.dev/docs/serving/
Apache License 2.0
5.57k stars 1.16k forks source link

gRPC Status in queue proxy metrics #13353

Open pradithya opened 2 years ago

pradithya commented 2 years ago

/area monitoring

Ask your question here:

Hi, I am exploring metrics exposed by queue-proxy for gRPC service. As of now, I noticed that in my current knative-serving (1.0.1) the gRPC status is not part of any metrics emitted by queue-proxy. response_code and response_code_class are there, however the value is always successful even for failing gRPC request.

For example, following query

round(sum(rate(revision_request_count[1m])) by (response_code), 0.001)

Produces below graph even though 100% of the request are actually failing.

Screenshot 2022-10-04 at 12 54 52 PM

Is gRPC metrics supported in more recent version? Or is there any plan to expose it?

jwcesign commented 2 years ago

I will check this.

jwcesign commented 2 years ago

I check the code:

func (h *requestMetricsHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
    rr := pkghttp.NewResponseRecorder(w, http.StatusOK)
    startTime := time.Now()

    defer func() {
        // Filter probe requests for revision metrics.
        if network.IsProbe(r) {
            return
        }

        // If ServeHTTP panics, recover, record the failure and panic again.
        err := recover()
        latency := time.Since(startTime)
        routeTag := GetRouteTagNameFromRequest(r)
        if err != nil {
            ctx := metrics.AugmentWithResponseAndRouteTag(h.statsCtx,
                http.StatusInternalServerError, routeTag)
            pkgmetrics.RecordBatch(ctx, requestCountM.M(1),
                responseTimeInMsecM.M(float64(latency.Milliseconds())))
            panic(err)
        }
        ctx := metrics.AugmentWithResponseAndRouteTag(h.statsCtx,
            rr.ResponseCode, routeTag)
        pkgmetrics.RecordBatch(ctx, requestCountM.M(1),
            responseTimeInMsecM.M(float64(latency.Milliseconds())))
    }()

    h.next.ServeHTTP(rr, r)
}

It will give all 200, u can't group by this.

BTW, what's the usecase? why need this info?

pradithya commented 1 year ago

BTW, what's the usecase? why need this info?

We are currently using revision_request_count to monitor the knative service (mainly monitoring error rate, throughput, and latency), but recently we added gRPC support and encountered this issue.