prebid / prebid-server

Open-source solution for running real-time advertising auctions in the cloud.
https://prebid.org/product-suite/prebid-server/
Apache License 2.0
431 stars 739 forks source link

Metrics Discussion #2211

Open bretg opened 2 years ago

bretg commented 2 years ago

Prebid Server has lots of operational metrics. Some would say too many. PBS-Java's metrics are at https://github.com/prebid/prebid-server-java/blob/master/docs/metrics.md

Towards rationalizing the set of metrics, here's a propose framework that divides them into three types:

A key issue with metrics is the load on the metrics database: tracking metrics at a granular level can be expensive. There are large number of combinations of accountsXadapters, and with a high volume of traffic, keeping metrics for all combinations can become expensive. We've addressed part of this combinatorial explosion by turning account-level metrics off by default.

For this thread, I'd like to propose that 'data quality' metrics don't need to be detailed. Data quality issues should be in logs because they often require several fields to provide the info necessary for debugging. So really all we need is a general alert that lets operational staff know that it's time to go look in the logs. In fact, host companies with advanced log systems wouldn't even need metrics.

So as a matter of general error-reporting, I'd propose that we start placing data-quality metrics in a small number of buckets:

Looking forward to community input.

SyntaxNode commented 2 years ago

So really all we need is a general alert that lets operational staff know that it's time to go look in the logs.

I like the idea of a general health trend. Host companies should avoid trying to drive these down to 0, that won't be possible, but instead use this as an indicator of patterns and would be a use case for control chart of ai based anomaly detection (not provided by Prebid :) )

I'd propose that we start placing data-quality metrics in a small number of buckets:

I'd like to see a more specific idea of what you have in mind for general and request alerts. For example, we already have request errors by endpoint - how would this be different? Might it be more useful for slightly more detailed buckets to give a better idea as to the source of the error? We can add more so long as there is no account or adapter cardinality.

I also like the idea of giving guidance for how long to potentially keep metrics, but that's purely up to the host company to configure. None of the metrics systems supported by PBS-X allow for a ttl.

bretg commented 2 years ago

I'd like to see a more specific idea of what you have in mind for general and request alerts.

I was thinking that we wouldn't start out moving existing metrics so much as having a place to put new alert metrics. For example, several of the recent PRDs define edge cases for data validation. Last thing we need is a separate alert for "floor vendor's JSON doesn't contain a required field". Here are some recent mentions of metrics in PRDs:

It was pointed out in the last meeting that we already have places to put errors:

So to flesh out the proposal more, I propose:

I would move the some existing metrics into alert.general: