google / openrtb-doubleclick

Utilities for DoubleClick Ad Exchange, including OpenRTB mapping, DoubleClick cryptography, metadata and validation
Apache License 2.0
197 stars 89 forks source link

It is impossible to ensure matching between the Geo object and the geo_criteria_id field #7

Closed aprotsenko closed 9 years ago

aprotsenko commented 10 years ago

OpenRTB API Specification (2.x versions) uses the Geo object to transfer information on a user/device location. Usually the following Geo object fields are used to provide geo targeting:

DoubleClick uses the geo_criteria_id field of the message BidRequest. The locations list is described in the geo-table.csv file. It includes country and region codes that meet ISO codes. However, it doesn't include city codes. Also city names don't match to UN/LOCODE city names in some cases. More over in certain cases city parent region don't match to the UN/LOCODE too. All the above mentioned makes it impossible to ensure matching between the Geo object and the geo_criteria_id field.

Some examples (UN/LOCODE vs geo-table.csv):

opinali commented 10 years ago

I have investigated this, some findings. The region names are available; for this specific example, you will find the following hierarchy of records:

1011969,"Moscow","Moscow,Moscow,Russia","20950,2643","","RU","City" 20950,"Moscow","Moscow,Russia","2643","RU-MOW","RU","Region" 2643,"Russia","Russia","","","RU","Country"

This seems correct: you have Moscow > Moscow (RU-MOW) > Russia (RU). I suppose the confusion is the fact that all records have the region code field, but this field is empty for cities, it's only populated for countries and regions (and other region-like divisions like territories, districts, etc.). If you want to target by the RU-MOW, you need to build an index of all these records and follow the child->parent relationships until you find the region. Notice that according to DoubleClick's docs, the list of Parent Criteria IDs is not a reliable way to do that (it's deprecated and may have errors and ambiguities); what you need to do is parse the Canonical Name column, in this case the city's canonical name is "Moscow,Moscow,Russia", so you need to discard the prefix "Moscow," that matches the Name field, and use the tailing string "Moscow,Russia" to locate the parent record. (You cannot CSV-split the string because many names have internal commas.)

If this looks boring, the DoubleClickMetadata class in the doubleclick-core library does all that work already and makes lookup easy and instantaneous, so you're lucky at least if your bidder is Java (if not, the code is a good practical documentation about how to handle all these DoubleClick datasets).

On the city name, you are correct that the official romanization and preferred name is "Moskva", but "Moscow" is also listed in the UN/LOCODE table as an alias; here's the records in their CSV distribution:

"=","RU","","Moscow = Moskva","Moscow = Moskva","",,"",,"","",""

These records with '=' in the first field are "reference entries", they allow defining alternative names for the same location. I'm not familiar with this standard but looking at other examples (e.g. Lisbon = Lisboa, Bucharest = Bucuresti, etc.) it seems this is used to provide the "preferred English name" for locations which native/roman name is not the most used internationally. DoubleClick's dataset always uses these preferred English names, so I think the easiest way to handle this would be having some code that loads the full UN/CEFACT data (or at least the aliases which are not many, only 86 records!) and maps between the English name and the primary/native name.

And for London we have this:

1006886,"London","London,England,United Kingdom","9041106,9047013,20339,2826","","GB","City" 9041106,"Greater London","Greater London,England,United Kingdom","20339,2826","","GB","County" 9047013,"London","London TV Region,England,United Kingdom","20339,2826","","GB","TV Region" 20339,"England","England,United Kingdom","2826","GB-ENG","GB","Province" 2826,"United Kingdom","United Kingdom","","","GB","Country"

This shows another reason to not use the Parent Criteria IDs: some records have multiple sibling parents, it's not a simple tree. But if you follow only the Canonical Names, the secondary path through the "London TV Region" above disappears, you have London > Greater London > England (GB-ENG) > United Kingdom (GB). Now on the different region codes: GB-LON is indeed the code for London, but GB-ENG is England. But the latter is not found in the latest ISO files, my reference is https://en.wikipedia.org/wiki/ISO_3166-2:GB which points to http://www.iso.org/iso/iso_3166-2_newsletter_ii-3_2011-12-13.pdf. Actually, the ISO files have a GB-ENG but it's the Englefield Green city in Surrey, England. it seems there's something wrong or obsolete, perhaps both the DoubleClick dataset and the Wikipedia article are outdated (the referred PDF is from 2011)? Not something that I know particularly well so this would need some further investigation...

eugen-yakovets commented 10 years ago

It is indeed not easy to map cities and region for GB between OpenRTB and doubleclick. One geo point can have several subdivisions according to ISO_3166-2 (https://en.wikipedia.org/wiki/ISO_3166-2:GB). For instance 'Winchester' city is located in 'Hampshire' county and 'England' country. https://en.wikipedia.org/wiki/Winchester OpenRTB specification have only one field for region and this create a space for different interpretations and confusion.

In doubleclick data there is geo criterias for:

There is no subdivisions for Northern Ireland,Scotland and Wales. So if need to get ISO_3166-2 region from doubleclick data the only good chose is to use countries/province as regions. This is pretty big regions and not likely other RTB exchanges will same approach. So in pactice it appear that you can't do one-to-one matching for GB regions. This makes city matching a somewhat tricky - there could be two cities with exact same name, and when you don't have regions you can mix them up.

aprotsenko commented 10 years ago

@opinali I appreciate you for your investigation. @eugen-yakovets thank you for the comment.

It is all clear with Moscow, but still there are issues with some other Russian cities. One of they is Yoshkar-Ola which is recorded as Joshkar-Ola in the UN/LOCODE list. There is a bundle of such inconsistencies. Yes, they are minor and we might handle they. And probably the “GB Issue” might be solved. However we sure there are other inconsistencies, which we haven't caught yet.

So far we made a decision to separate DoubleClick geo targeting from geo targeting of other Exchanges in our product. At least this decision ensures accuracy of geo targeting.

opinali commented 10 years ago

@eugen-yakovets @aprotsenko You are correct that the DoubleClick geotargeting data is not as complete as it could be... even in the US where the map seems to be more "organized" you can find towns with the same name in the same state, but you won't find any of that in the geo-table.csv, we just put a single town in these cases. The solution for this is usually using the zipcode, but then the problem is that you can have many zipcodes per city. You can use the zipcode prefix (three first digits) which gives you regions that are bigger than cities, but not exactly the same as counties, so this may or may not be useful for targeting and it would be US-specific anyway. For real good per-city targeting you need to load in your bidder the full zipcode database and then it's simple to find cities precisely. You can find the data for all countries in GeoNames.org, would that be a good solution? (Unfortunately not planning to add this to the openrtb-doubleclick project, which scope is set to implementing OpenRTB & DoubleClick protocols and datasets but not external things like GeoNames.)

aprotsenko commented 10 years ago

Probably use of postal codes is a solution. I'll ask our account manager if field 'postal-code' of BidRequest is mostly filled for Russia and other countries.

Thank you, @opinali.

eugen-yakovets commented 10 years ago

thanks for suggestions @aprotsenko @opinali Yes, using postal codes for city targeting is an interesting solution.

This add additional level of indirection and I can imagine that such solution have own issue, for instance with postal codes that belong to two or more cities or even several states.

Looks like precise targeting is going to be tough anyway. I'll check GeoNames.org data and come back if found any another solution.

Any chances doubleclick will send ISO-3166 geo in bid requests?

opinali commented 9 years ago

@eugen-yakovets (reviewing this after some time, noticed your question...) No plans for geo changes like you suggest, I'm afraid. We're adding improvements like Hyperlocals, but the existing fields are not likely to change.