prebid / prebid-server

Open-source solution for running real-time advertising auctions in the cloud.
https://prebid.org/product-suite/prebid-server/
Apache License 2.0
433 stars 741 forks source link

Extend GeoLookup ability #3392

Open bretg opened 10 months ago

bretg commented 10 months ago

There are several use cases for having Prebid Server do geographic lookups:

  1. Privacy regulation targeting (GDPR, Activity Controls)
  2. Bidder geo-scoping
  3. Allow for modules to know user location: traffic shaping, timeout optimization, ...

Currently only PBS-Java does geo-lookups, and only for GDPR scope as described here. The lookup is only called when no other signals indicate GDPR scope and when the account wants PBS to enforce GDPR.

The problem is that geo-lookups have a cost in both latency and money, so there should be controls for the host company to manage the volume.

The proposal is that there should be account-level config that will cause it to do geo-lookups early in the workflow in support of the above use cases.

  1. This should happen before the raw-auction-request module stage. The goal is that as soon as possible at the start of the workflow a module could shape traffic.
  2. If host company config geolocation.enabled is false, don't do lookup. Default is false.
  3. If the account ID is available do the account config lookup. Else, if account is not available, look in the host-level default account config. If the new configsettings.geo-lookup is true (defaults to false) and if request $.device.geo.country is not specified, then PBS should do the lookup and set device.geo.country to an ISO-3166-1-alpha-3 code and device.geo.region to ISO-3166-2; 2-letter state code if USA.
  4. Use the existing metrics:
    1. geolocation_requests
    2. geolocation_fail
    3. geolocation_request_time
  5. No change in the GDPR processing other than make sure that it checks for device.geo.country before doing the lookup necessary. To be clear, settings.geo-lookup does not change or disable the ability for PBS to determine GDPR scope per the flowchart. The GDPR lookup feature is disabled if the overall geolocation.enabled is false.
bsardo commented 10 months ago

@bretg in an effort to keep things as simple as possible, I'm wondering if we need the host config geolocation.enabled. If a host company does not want to enable geolocation they should be able to do so by omitting settings.geo-lookup from their account configs and just leaning on the account default settings.geo-lookup which defaults to false. I guess this could be of value though if you want the ability to globally toggle it, perhaps for testing purposes? Is there another scenario I'm missing where you envision this being of value?

From an optimization perspective, this host config isn't needed in PBS-Go since our discussion earlier today at the PMC led to a requirements change where the lookup happens before the raw auction stage instead of before the entrypoint stage. With this requirements change, PBS-Go should be able to take advantage of the existing account fetching logic instead of having to perform geo lookup specific parsing to extract the account ID and fetch the account object.

bretg commented 10 months ago

wondering if we need the host config geolocation.enabled.

This is an existing config. The use case for keeping it would be as a master kill switch for geo lookup in case there's an issue with the geo lookup servers.

I'm willing to consider removing it in PBS 3.0, but it would be a breaking change to remove it now. I think it would be fine to not implement in PBS-Go if it doesn't exist there now.

With this requirements change, PBS-Go should be able to take advantage of the existing account fetching logic

There are scenarios where the account ID isn't available until after reading the stored requests.

muuki88 commented 10 months ago

I just wanted to throw another idea here. It's fare to assume that most hosting companies do use a cloud provider or at least some sort of loadbalancing. Especially if your are acting globally, you probably do load balance on the geolocation of the user. Here's a short list of commonly used technologies for load balancing around geo

In the end it boils down to two sorts of load balancing

  1. application (HTTP)
  2. network (TCP/IP)

Both variants can be used to make the geo information available directly in the request, without the need for geo location lookup.

Application (HTTP)

Example: https://cloud.google.com/load-balancing/docs/https/custom-headers?hl=en

If host companies use an application load balancers, it can add HTTP headers that contain the geo location. It's just a matter of reading those from the HTTP request.

From my minimal knowledge, application load balancers are more expensive, hence network load balancers are preferred if possible.

Network (TCP/IP)

If the requests is re-routed to another IP address, there's no possibility to append information. However it would be possible to statically provide the information to the running instance in which geo it is running. This would at least provide the continent.

Proposal

We could extend the geo location lookup to a multi step process, where every step can be enabled disabled. For example

geolocation:
  enabled: true
  # define the order in which the geo location should be determined
  lookup:
     - cloudfront-header # checks if there's a http header from cloudfront
     - maxmind           # check a geo database if available
     - static            # use a statically provided value
  # configure all the rest

This is a super rough sketch, just to transport the idea.

bretg commented 9 months ago

Interesting thought @muuki88 , but I don't think this is going to be possible with DNS-based geo-balancers like Akamai's GTM... there's no 'edge' in that case for 'cloudlets' to work in or attach headers to.

Here's a counter-proposal:

Net-burst commented 8 months ago

I want to add a proposal to add a sampling in addition to the feature toggle. So only a certain configurable % of requests for any given account will have geo lookup happen early. The idea here is to have finer control when the host company has one dominant account that generates most of the traffic. Although this is more of an operational feature.

bretg commented 7 months ago

done wit PBS-Java 2.13