Metrics Discussion - Githubissues

bretg commented 2 years ago

Prebid Server has lots of operational metrics. Some would say too many. PBS-Java's metrics are at https://github.com/prebid/prebid-server-java/blob/master/docs/metrics.md

Towards rationalizing the set of metrics, here's a propose framework that divides them into three types:

operational: metrics covering things the host company has direct control over -- the hardware, config values, connections to PBC/geo-lookups. These metrics are useful to store for 30-90 days.
business reporting: metrics about things the business needs to understand for cost and revenue reasons -- how many requests to bid adapters and from accounts. These metrics may be desired for longer term, perhaps a year.
data quality: metrics covering things that can go wrong with input from clients or responses from adapters. It's assumed that dramatic problems will be caught by either client or SSP/DSP, but the host company is in a position to help everyone detect edge-case problems. These metrics are of interest only in the short-term... less than 30 days.

A key issue with metrics is the load on the metrics database: tracking metrics at a granular level can be expensive. There are large number of combinations of accountsXadapters, and with a high volume of traffic, keeping metrics for all combinations can become expensive. We've addressed part of this combinatorial explosion by turning account-level metrics off by default.

For this thread, I'd like to propose that 'data quality' metrics don't need to be detailed. Data quality issues should be in logs because they often require several fields to provide the info necessary for debugging. So really all we need is a general alert that lets operational staff know that it's time to go look in the logs. In fact, host companies with advanced log systems wouldn't even need metrics.

So as a matter of general error-reporting, I'd propose that we start placing data-quality metrics in a small number of buckets:

alerts.general - this may be enough? If not, then perhaps a high level grouping like:
alerts.request
alerts.response
alerts.modules

Looking forward to community input.

SyntaxNode commented 2 years ago

So really all we need is a general alert that lets operational staff know that it's time to go look in the logs.

I like the idea of a general health trend. Host companies should avoid trying to drive these down to 0, that won't be possible, but instead use this as an indicator of patterns and would be a use case for control chart of ai based anomaly detection (not provided by Prebid :) )

I'd propose that we start placing data-quality metrics in a small number of buckets:

I'd like to see a more specific idea of what you have in mind for general and request alerts. For example, we already have request errors by endpoint - how would this be different? Might it be more useful for slightly more detailed buckets to give a better idea as to the source of the error? We can add more so long as there is no account or adapter cardinality.

I also like the idea of giving guidance for how long to potentially keep metrics, but that's purely up to the host company to configure. None of the metrics systems supported by PBS-X allow for a ttl.

bretg commented 2 years ago

I'd like to see a more specific idea of what you have in mind for general and request alerts.

I was thinking that we wouldn't start out moving existing metrics so much as having a place to put new alert metrics. For example, several of the recent PRDs define edge cases for data validation. Last thing we need is a separate alert for "floor vendor's JSON doesn't contain a required field". Here are some recent mentions of metrics in PRDs:

price-floors fetch failure
errors in dynamic account configuration: syntax, unknown values, data types
events.requests.{err,ok}
adapter.ADAPTER.requests.badserverresponse
rejection of a bidresponse in the ORTB blocking module

It was pointed out in the last meeting that we already have places to put errors:

requests.badinput.{amp/web/app}.count
accounts.ACCOUNT.requests.rejected.count
adapter.ADAPTER.requests.unknown_error.count // not clear whether 'requests' here can also mean 'response'

So to flesh out the proposal more, I propose:

existing metrics fall into the operational and business categories noted above. Any new metrics that are needed for longer periods of time should go into one of the existing high level categories.
short term data quality metrics that can be cleaned up often should go into a new metric alert.general.

I would move the some existing metrics into alert.general:

errors in dynamic account configuration. I don't need to know that someone screwed up a DB entry for longer than it takes to fix it.
others TBD

prebid / prebid-server

Metrics Discussion #2211