Showing metrics deviation(p99, errors) while listing services/endpoints

subintp commented 3 years ago

Use Case

In the microservice world, when a customer reports an issue related to the error/degradation/latency we start debugging the by asking the below questions

Which services error rate spiked in the given timeline?
Which endpoint degraded in the given timeline?

We can identify the deviation for error/latency by going to the respective service/endpoint overview dashboard and check the patterns in the errors or latency graph. This workflow is not scalable for large number of services and dependencies.

Proposal

Add metrics(p99, error) deviation while listing services and endpoints.

Screenshot 2021-09-01 at 1 10 30 AM

kotharironak commented 2 years ago

I think the requirement here is to have a column showcasing the change in latency (or error) with respect to the prior hour if the current dropdown is 1 hour.

Currently, most of the attributes are calculated at ingestion time. Doing this at ingestion time will be complex as we need the information of the prior hour (predefined window) and currently, our view-gen is stateless. Secondly, it will be limited to a set of pre-defined time windows used for comparison (say 15 mins or 30 mins).

So, this seems to be more suitable by doing query time. So, here, I think, we will need to fire two queries for a two-time window (one for the current hour, and one for the prior hour) and calculate the value for that attribute. where should we do this at query service/gateway service?

Do we also have to support orderby on such a column? @Jayesh, do you think of any other way to capture this requirement in UI?

@aaron-steinfeld do you have any thoughts on this?

aaron-steinfeld commented 2 years ago

Currently, most of the attributes are calculated at ingestion time. Doing this at ingestion time will be complex as we need the information of the prior hour (predefined window) and currently, our view-gen is stateless.

Metrics are calculated at read time at a service (or any aggregate) level. Only individual span values are calculated at ingestion time.

The tricky bit is basically what you said, that any delta (and I think there might be some work going on for deltas elsewhere, @jake-bassett - are you aware of any?), is defined by two time ranges, the current and the comparison. Sometimes the previous window makes sense, but that's really use case driven. For example, if I'm looking at the past hour and this issue has been happening for 2 hours, the prior hour is far less useful to me than the same hour yesterday. So new controls would likely be needed, which introduces more complexity - one of the reasons we've abandoned efforts like this in the past.

As far as order by support - if we compute the delta client side, like I was assuming, we wouldn't have support for order by (we could probably hack it in for the current page of data, but I'd argue against the inconsistency). If we compute the delta server side, that's a more significant change, and I guess the answer there would be - depends on how we introduce that support.

hypertrace / hypertrace-ui

Showing metrics deviation(p99, errors) while listing services/endpoints #1102

Use Case

Proposal