Prebid Cache Cross Data Center Lookup

SyntaxNode commented 3 years ago

This is a follow-up to https://github.com/prebid/prebid-server/issues/1562 to focus on the situation where a host has multiple Prebid Cache data centers which do not sync with each other and the end user is directed to a different data center for the PUT and GET requests.

Summary

Prebid Cache provides hosts with the ability to configure a variety of different backend storage systems. These storage systems may run in an isolated state or sync with each other. Due to the large amount of data retrieved shortly after being written and the low chance of a cross data center lookup, many hosts including Xandr and Magnite do not sync their data center caches. As @bretg mentioned, it would be impossible (or at least prohibitively expensive) to try to replicate caches of this size globally within milliseconds.

We have not seen evidence of widespread issues with this setup that's been in place for many years, but there are a number of community reports which indicate otherwise. I'd like to begin our investigation by measuring the rate of occurrence to determine if we need to build a solution.

Proposal

Include a new feature for Prebid Cache to determine if a GET request is for a PUT request handled by a different data center. I see two options:

Accept a new query parameter for the GET request which is set the hb_cache_host targeting key via macro resolution. I believe this would be the cleanest solution, but I recognize it requires action to be taken by the publishers. I'm hopeful publishers suspecting this is an issue would be willing to assist in collecting metrics.
Encode the data center into the already automatically generated cache id. Some Prebid Cache calls provide their own cache keys which obviously wouldn't work, but that use case is likely small enough that we can still collect enough metrics.

Thoughts?

bretg commented 3 years ago

Discussed in PBS committee

PBC does have a read miss metric but it doesn't distinguish between different reasons like timeout, bad UUID, or wrong datacenter. However, Magnite sees only about 1% read-miss rate, so this doesn't appear to be a major problem.

We don't particularly like any of the available measurement solutions, so at this time we're proposing to adopt a wait-and-see approach. If the community has data that shows a more concrete problem, please post it to this issue.

spormeon commented 3 years ago

if you hit the LB'er and go to datacentre 3, instead of 2, what metric is collected there? none? No pubs got anyway to test this, they cant hit the server ip behind the Lb'er, the only thing there going to know if % discrepancies between imps they thought they had V what is/ was "paid" recorded in back systems, there juts going to "take it on the chin" as loss. Oh well 10%, 20%, 5% difference, what can I do?

bretg commented 3 years ago

if you hit the LB'er and go to datacentre 3, instead of 2, what metric is collected there?

We would be seeing cache read misses on datacenter 3. We're not.

Are you actually seeing 20% discrepancy between Prebid line items delivered (bids won) and video impressions? If that's the case, then would you be willing to update your ad server creatives to add another parameter?

bretg commented 3 years ago

We still don't have evidence that this is a problem, but I'll move the ball forward by proposing a relatively small feature based on SyntaxNode's first proposal above:

Accept a new query parameter for the GET request which is set the hb_cache_host targeting key via macro resolution

1) support a new "ch"(cache host) parameter on the /cache endpoint http://HOST_DOMAIN/cache?uuid=%%PATTERN:hb_uuid%%&ch=%%PATTERN:hb_cache_host%%

2) the hb_cache_host is set by PBS to the actual direct host name of the cache server

            "hb_cache_host": "pg-prebid-server-aws-usw2.rubiconproject.com:443",

3) when PBC receives requests with the 'ch' parameter, it's validated and processed

 a) if the hostname portion is the localhost, then cool, end-of-line. Look up the uuid as normal.
 b) otherwise, verify that the named host is acceptable. We are not an open redirector. e.g. configure a regex in PBC that ensures that all `ch` values conform to *.hostdomain.com
 c) if the host is ok, proxy the request but remove the `ch` parameter. One hop only. No chains allowed. Add the other pieces of the URL as needed -- the "https" protocol, the URI path, and the uuid parameter.
     - when the response comes back, log a metric: pbc.proxy.success or pbc.proxy.failure
     - return the value to the client
 e) if the host did not match the regex, just ignore the `ch` parameter. Look up the uuid as normal.

patmmccann commented 3 years ago

Fwiw, a 1% read miss rate seems like a rather substantial problem to me.

bretg commented 3 years ago

read misses can come from late or late-and-duplicate requests as well as wrong datacenter.

anyhow, appreciate the kick here -- this had dropped off our radar, put it back in the stack of tickets to get done this summer.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

bretg commented 1 year ago

This was partly released with PBC-Java 1.13, but there's an outstanding bug where most requests get 'Did not observe any item or terminal signal' errors

prebid / prebid-server

Prebid Cache Cross Data Center Lookup #1620

Summary

Proposal