privacycg / proposals

New proposals in the Privacy Community Group
https://privacycg.github.io
122 stars 5 forks source link

Suggested and User-Specified Hierarchical Interests (SUSHI) #27

Open RussStringham opened 2 years ago

RussStringham commented 2 years ago

This proposal builds on ideas from other proposals including PAURAQUE and Ad Topic Hints. It is centered around a hierarchical set of topics of interest that can be used in selecting relevant ads. This hierarchy might look something like the IAB's Content Taxonomy. It will work best in supporting interest-based advertising in a privacy preserving manner when combined with a proposal such as PARAKEET, which can support serving interest-based ads without allowing the site to tie those interests to the user. It might also be possible to adapt it for use by one of the TURTLEDOVE variants.

Suggested Interests

For a new user, SUSHI works similarly to PARAKEET and TURTLEDOVE in that advertisers (or their DSPs) can suggest topics that might be of interest to the user, and the browser remembers these suggestions. An important difference though is that the suggestions must be from a standardized hierarchy of ad topics. In most cases, the hierarchy will not be detailed enough to identify specific products such as a particular designer athletic shoe, but will instead be limited to the type of shoe, such as basketball shoe. This reduces the creepiness factor of having the same pair of shoes follow you around on the internet. Obviously if a particular advertiser only sells a single shoe of that style, then the shoes may still follow you.

Publishers will use the same interest hierarchy to suggest contextual topics that might be of interest to the user. However, in addition to using the topics for serving ads on the current page, the browser will remember the suggested topics to build an internal profile of the user's interests.

Over time, topics that are of particular interest to the user are likely to be suggested by multiple different websites. This will be more common at higher levels in the hierarchy. For example, different sites might suggest

In this case sports has been suggested four times, football three times and college football twice ("pros" isn't counted as occurring twice, because its two occurrences are in different subtrees of the hierarchy). With enough suggestions from enough different sites, the browser can start to identify recurring topics that are likely to be of the most interest to the user.

In some instances, interest can be surmised from recurring topics suggested by a small number of publisher sites that the user visits frequently. However, this should not be implemented for advertiser suggestions. If only a single advertiser is suggesting a topic, they don't want their competitors to be able benefit from the suggestion, enabling the competitor to serve ads that might steal away the customer. Thus, advertiser suggested interests should remain exclusive to the advertiser until the same topic is suggested by a sufficient number of different advertisers and/or publishers.

Ranking

Browsers should keep a list of suggested topics for at least 30 days (unless cleared by the user) and use these to determine topics of interest to the user. Suggestions should only be considered for sites with which the user has interacted (so that sites cannot game the system by redirecting through a number of sites and making the same suggestion on each of them). A primary weighting factor should be the number of different sites that suggest a topic. The computation should also give more weight to recent suggestions. Frequent suggestions from a small number of sites that a user visits often can receive extra weight as well.

For example, assume that a user has visited 100 unique websites in the last 30 days, but only 10 in the last week. Of those 10, the user has visited 3 on at least 15 different days over the last month (every other day on average). Suggestions from these 3 should carry the most weight, followed by the 7 that were also visited recently. The remaining 90 should contribute less.

Suggestions older than 30 days should be forgotten, except for one use case. Many interests are cyclical/seasonal, such as sports leagues or holidays. A user may show very high interest in a particular sport or team until the season ends, after which little interest is shown until the start of the next season. If the browser keeps track of previous, strong interests, it may accelerate restoration of the topic's ranking when the user starts to show interest again at the start of the next season, rather than requiring it to start over from zero.

Each advertiser and/or publisher page should only be able to contribute a limited number of suggestions. If one page suggests five topics, while another only suggests one, the single topic might get five times as much weight as any of the five. Alternatively, we may want to allow the site suggesting five topics to rank its suggestions, perhaps assigning a weight of 50% to the primary topic, 20% to a secondary topic and 10% to each of the remaining topics.

User-Specified Interests

The browser should display a special advertising icon near the URL/bookmarks area that the user can click on at any time that advertising is present to learn more about the ads on the current page and to see their advertising preferences. When clicked, this should show:

The user should be able to click on any topic and see the sites that have suggested that topic. They should be able to examine the topic hierarchy and manually specify any topic that they are currently interested in. They should also be able to block any topic from ever being identified as of interest to them.

Topics that are selected by the user should receive a high weighting factor. If the user has specified multiple topics of interest, those that have also received numerous suggestions could have higher weights than those with fewer suggestions. This might be accomplished by giving a boost to the weights computed for each user-specified interest. If the user-specified interest is not a leaf node in the hierarchy, then an incremental weight should also be added recursively to all its children.

Weights should be set to zero for topics that are not of interest to the user, as well as for child nodes of that topic.

A more advanced UI might allow the user to specify a duration for increased or blocked interest. For example, the user could indicate that they are always interested in the topic or that their interest is immediate and should expire within a few weeks. Similarly, after completing a big purchase, they might want to flag that they are not interested in that topic for the next 30 days (at which point all suggestions that resulted from their research/comparison shopping into this topic will have expired).

Ad Serving

When a website requests an ad using SUSHI, the browser will call PARAKEET with some subset of the suggested interests. The subset should always include all of the contextual interests suggested for the current page (PARAKEET may filter some of these out if the combination is uniquely identifying). The browser will also randomly select a few topics that have a sufficient number of suggestions, with those that have higher computed weights having a higher probability of being selected. Advertiser-specific suggestions should include the advertiser and its ad networks, so that PARAKEET can provide those suggestions only to the appropriate ad network if available.

The browser should include three flags for each suggested interest:

PARAKEET may decide to use these flags only to help identify attributes that should be filtered out because the combination is too unique, or it may also share them with the ad network(s). Availability of these flags might influence bids based on how the interest for the user was identified. For example, if a topic is suggested by the current context (first flag) and it's also a topic that has been suggested a lot previously for this user (third flag), then an ad related to this topic might be worth more. PARAKEET would not receive the actual weights assigned by the browser to the topics.

Because only a randomized subset of interests is shared and changes over time based on the suggestions from sites most recently visited, the set of interests cannot easily be used to uniquely identify the user. PARAKEET can further restrict unique combinations (especially of those suggested by the current page), but as the combinations will change with each call, it may need to do less filtering that it would using its current design.

The browser might exclude any interest that is blocked by the user from ever being shared, or if the topic was suggested by the current page as a contextual topic, it may allow it to still be shared (with only the first flag set), as not doing so might be identifying.

Clicks and Conversions

When an ad is served, it should include a list of topics that contributed to the ad being selected. If the user clicks on the ad, that should indicate a higher level of interest in those topics. Topics associated with the ad that were passed by the browser to PARAKEET should perhaps receive more of a boost than topics returned with the ad that weren't in this initial list. Note, however, that the enhanced suggestions should only apply to that particular advertiser, unless the topic has previously been (or in the future is) suggested by a sufficient number of unique domains.

This feature could be integrated with the conversion reporting APIs, such as the Google's proposed aggregate reporting API. These aggregate reports could include not only the ads viewed from this advertiser, but also the interests that inspired those ads. When the advertiser signals the conversion to the browser, the advertiser could provide a set of interests related to item(s) purchased. For each interest, the advertiser should be able to state whether interest is likely to continue or not in that topic. For example, after purchasing a big item, the user is unlikely to purchase another for a long time and that topic's weight should be decremented so that advertiser does not continue to bid on it. If interest is expected to continue, the topic should receive a boost, making it more likely to be shared in the future (again only for this advertiser, unless it is a common interest).

After either clicks or conversions, the advertiser should be able to tell the browser to remove advertiser-specific attributes of a specified interest topic, so that they don't pay for continuing ads that they no longer feel are likely to be productive.

Hierarchy Detail

Interest Reporting across Hierarchy Levels

When the browser reports to PARAKEET that a user is interested in a particular topic that happens to be a leaf node in the hierarchy, that implies that the user is also interested in all the topics of each higher-level node up to the root. For example, someone that is interested in the 49ers is also interested in football and in sports. In fact, using the rankings algorithm described above, those more general topics are going to be at least as likely and generally more like to be suggested than the 49ers topic itself. However, while higher levels in the hierarchy are more likely, advertisers will pay more for lower levels of the hierarchy, as they allow for more focused ads.

To assist in favoring reporting interests lower in the hierarchy, when sufficient interest in those lower levels has been suggested, we might take a couple of different approaches. One approach would be to simply reduce the weight of each suggestion for nodes higher up the hierarchy. For example, a suggestion for a leaf node gets a weight of 10, while its parent gets a 7 and its grandparent gets a 4. However, this has the drawback that if the user is really interested in the higher level topic, and doesn't spend a lot of time in any combination of the lower-level topics, it may take longer for the general interest to manifest itself.

Another option would be to start the topic selection process by randomly selecting from only the top nodes in the hierarchy according to their weights. For the selected node, chose one of its child nodes with a probability of each node's weight relative to that of its siblings. Repeat down to leaf nodes. Include only the final node in the ad request to PARAKEET. For example, if there are five child nodes, which have weights of 10, 7, 3, 0 and 0, then the first node would be selected 50% of the time, the second 35% of the time and the third 15% of the time, with the last two never being selected, because the user has not shown any interest in them thus far. We could further modify this so that some percentage of the time we don't select any child node and instead report the parent. The probability of choosing the parent should increase as the combined weight of the children decreases, as this means there has thus far been little interest in the lower levels. This node selection process would be repeated 5-10 times so that the call to PARAKEET could include up to 5-10 topics (the actual number sent to PARAKEET could be fewer if the algorithm selects the same topic multiple times, which is likely for topics for which the user has shown a lot of interest).

Hierarchy Levels

The current IAB hierarchy is not detailed enough to support this proposal. It probably requires at least one more level of detail in many branches of the tree. For example, the automobile section might add another level for the make of the car, while the sport sections might add a level for specific teams and the shoe sections could specify different styles of shoes (dress, casual, sports, with an additional level below these).

I also envision a method where ad networks could support a limited number of custom values one level deeper than the public hierarchy. There would be limits on the number of unique nodes that an ad network could provide, and these nodes would not be available to other ad networks, but the higher levels of the hierarchy at least would still get a boost when these lower levels are set. If there is sufficient interest in pursuing this, I can provide more details.

Privacy and Usability Properties

Explainability

When a user wants to know why they saw a particular ad, the browser can show them the interest(s) that resulted in that ad. For each interest, they can see the sites that suggested the user might be interested in that topic and the number/percentage of times that topic was suggested by each site.

User Profile

Within the limits of PARAKEET and Fenced Frames, even if an advertiser, ad network and/or publisher collude, they should not be able to tie ad requests back to specific users (except when the user clicks on an ad). Thus, these parties should not be able to use the interests to build a profile of the user. Because the interests change for each ad request and over time, fingerprinting should not be feasible.

Example Algorithm

The complete algorithm might look something like:

for each day of the last 30 days
    for each site visited on this day
        for each suggested topic
            compute the sum of weights for the topic (including all of its child topics) for each page (of this site)
            divide result by the square root of the number of pages visited on this site
            multiply weight by sqrt(30 minus days from current day) / sqrt(30)
for each suggested topic
    compute the sum of all of the above weights
normalize each topic's value to a value between 0 and 1
for each clicked ad
    for each topic on this clicked ad
        move topic's value 10% closer to 1
for each topic
    if user has flagged topic as interesting (or any of its ancestors)
        move topic's value 50% closer to 1
    else if user has flagged topic as blocked (or any of its ancestors)
        clear topic's value
    else if topic's value is smaller than some delta
        set topic's value to delta, so that it has a non-zero chance of sometimes being selected.
for each non-zero topic
    count unique publishers recommending topic
    count unique advertisers recommending topic
    if unique advertisers is greater than 0 and less than 5 and unique publishers is less than 5 and user has not flagged topic as interesting
        restrict topic to only these advertisers and publishers
erik-anderson commented 2 years ago

@RussStringham this looks like a topic probably better suited for discussion in the Private Advertising Technology CG. If there's still interest in exploring this forward, would you like to explore proposing it over there now that that group has been created?