google / knative-gcp

GCP event implementations to use with Knative Eventing.
https://github.com/knative/eventing
Apache License 2.0
159 stars 75 forks source link

Metrics around Trigger Replies #2117

Closed Harwayne closed 3 years ago

Harwayne commented 3 years ago

Problem We have metrics for Triggers sending events to the subscriber, event-count and dispatch-latency. But we don't have any metrics around sending replies back into the Broker. After #2107, if the Trigger's reply request does not succeed, the overall delivery is considered a failure and the event is sent to the subscriber again. We need to have metrics so that the user understands what went wrong.

Exit Criteria Triggers expose metrics around sending replies to the Broker's Ingress.

Additional context (optional)

bharattkukreja commented 3 years ago

Doesn't this correspond to filtering out 2xx responses and summing the values in a group by clause containing all the labels apart from response_code/response_code_class?

bharattkukreja commented 3 years ago

Created a similar issue in knative/eventing based on Adam's suggestion that it could be worthwhile exposing these metrics in OSS: https://github.com/knative/eventing/issues/4831

Harwayne commented 3 years ago

Doesn't this correspond to filtering out 2xx responses and summing the values in a group by clause containing all the labels apart from response_code/response_code_class?

I don't think so. The only events that would affect reply-event-count are those that already are in the 2xx status code of event-count, because the only way a reply is even attempted is if the request to the subscriber succeeds and it replies with an event.

Essentially, the flow of the event is something like:

  1. An event is sent to the Broker and goes to this Trigger.
  2. This Trigger sends the event to the subscriber.
  3. The subscriber responds with a 200 and a cloud event.
    • This is added to the event-count and dispatch-latency metrics.
  4. The Trigger sends the replied event to the reply address.
  5. The reply address responds with some status code.
    • This is added to the reply-event-count metric, suggested in this issue (name is just an example, not a real suggestion).
github-actions[bot] commented 3 years ago

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.