Open dmarti opened 1 year ago
I believe several points raised during the TAG design review still stand.[1][2]
American Football
, etc.) in user preferences. I propose following:
This issue seems primarily focused on a wish for some kind of "ML explainability" — that is, for a person to be able to understand what decisions are being made by ML models — and then, building on that, for a person to be able to influence the signals that act as inputs to those ML models.
Those both seem like fascinating policy proposals. But neither one of them particularly has anything to do with the Topics API. Suppose an ML model has 1000 input signals today, and use of the Topics API means the model would have 1001. I don't see how anything specific to Topics would be helpful here.
The broader goals would surely be areas for regulation or policy-making about ML in general, which would then naturally apply to all 1001 inputs, Topics included.
Separately from all that, Don writes
The absence of a common topic may be a stronger signal than its presence for inferring some data points.
The Topics selection algorithm, especially the combination of random selection and per-caller filtering, make this extraordinarily unlikely.
I'm having trouble following this.
Topics API provides enough additional information about membership in a group of people that is not legally protected to motivate a legit advertiser to choose to use it (the number of topics is much smaller than the number of targetable audiences, so advertisers are likely using inferred group membership for targeting and not simply the raw topic)
I don't understand how the math would work out here. Is Topics API one insignificant input among 1001, or is it meaningful? Please reopen.
I certainly hope that Topics is a useful signal to an ML model. If it's not, then we haven't produced something useful for the web.
But the question "Is this input useful?" is completely different from the issue you raised above, "Can the party who starts using Topics in their ML model provide the user with an explanation of the way in which that signal affects each model output?" Absent a broader context of ML explainability, the gulf between these two questions is enormous, and absolutely not in scope for this work.
As far as scope goes, I'm looking at https://github.com/patcg-individual-drafts/topics#meeting-the-privacy-goals:
- Users should be able to understand the API, recognize what is being said about them, know when it’s in use, and be able to enable or disable it.
For purposes of some (most?) jurisdictions, what's being said about you includes inferred data points (see the Colorado rulemaking and resulting regs for a good example). And in order to get consent to call Topics API in consent-based jurisdictions, the caller is going to need to be able to disclose how the information obtained will be processed.
I know that this is a complicated project with a lot of unknowns but it seems like getting that existing item 4 handled well enough for a caller to implement Topics API in a compliant way would be in scope for a generally usable version. It might be possible to expand Privacy Sandbox Market Testing Grant to include researchers with access to first-party data and expertise in housing and employment discrimination?
For purposes of some (most?) jurisdictions, what's being said about you includes inferred data points
I'm not a lawyer or legislator, I'm a browser engineer. From that point of view, it seems like you are trying to conflate "what party A says" with "what party B infers based on what party A says". These two seem fundamentally different, to me, because one is a reasonable question to ask of party A and the other of party B.
In the most common uses of Topics API, Google is both A and B. Google A (browser code) encodes the user's group membership as a set of topics and passes this encoded form of the information to Google B (server-side) which decodes it.
In this type of situation, A and B together could hypothetically provide a service that implements full-stack illegal discrimination. If the company is organized in a sufficiently complex way, it could operationalize a two-room version of the Chinese Room Experiment. The browser side would be inside one room, following rules that result in assigning and passing topics without any browser developer ever understanding which groups are encoded by which topics, while the advertising side is in a separate room, decoding topics in such a way as to maximize results for advertisers, some of whom discriminate in illegal ways.
Neither A nor B intends to implement illegal discrimination, but by cleverly partitioning the two rooms the company as a whole could offer illegal discrimination to the advertisers who want it without exposing records of it to users or developers.
(edited to express as a hypothetical situation)
User story summary: as a person looking for a job or for rental housing, I want to receive ads for all jobs and housing units for which I am a qualified employee or tenant, even though I am a member of a group of people against which some employers or landlords discriminate.
A user who is concerned about leaking sensitive information can already choose not to share specific topics. However, topics are being fed into a machine learning (ML) system, which then infers a variety of data points about the user. The absence of a common topic may be a stronger signal than its presence for inferring some data points. (Some examples might include a religious group that avoids certain foods or beverages, or a person with a disability that limits their ability to do certain activities.)
A user can already control the set of topics being passed. In addition, the user needs enough information to be able to meaningfully decide what topics to share, including any inferred data points that are applied to them based on presence or absence of topics. (Simply blocking Topics API entirely is probably inadequate, since ML systems will "learn" to avoid users with Topics API turned off when placing some or all employment or housing ads.)
Open to ideas for possible implementations for how to pass this information back to the user from the caller.