patcg-individual-drafts / topics

The Topics API
https://patcg-individual-drafts.github.io/topics/
Other
624 stars 234 forks source link

How can the caller make inferred user data available to the user, to inform meaningful topics selection? #221

Open dmarti opened 1 year ago

dmarti commented 1 year ago

User story summary: as a person looking for a job or for rental housing, I want to receive ads for all jobs and housing units for which I am a qualified employee or tenant, even though I am a member of a group of people against which some employers or landlords discriminate.

A user who is concerned about leaking sensitive information can already choose not to share specific topics. However, topics are being fed into a machine learning (ML) system, which then infers a variety of data points about the user. The absence of a common topic may be a stronger signal than its presence for inferring some data points. (Some examples might include a religious group that avoids certain foods or beverages, or a person with a disability that limits their ability to do certain activities.)

A user can already control the set of topics being passed. In addition, the user needs enough information to be able to meaningfully decide what topics to share, including any inferred data points that are applied to them based on presence or absence of topics. (Simply blocking Topics API entirely is probably inadequate, since ML systems will "learn" to avoid users with Topics API turned off when placing some or all employment or housing ads.)

Open to ideas for possible implementations for how to pass this information back to the user from the caller.

amicus-veritatis commented 1 year ago

I believe several points raised during the TAG design review still stand.[1][2]

Proposal

I propose following:

  1. User should be able to able or disable specific individual topics in a user-friendly manner.
    • Related: #78
  2. Ideally, users should be able to set such topics through a browser extension, or a JSON-based configuration or any other community-focused solution.
    • While it was noted that per-page topic manipulation is not really an ideal option from a data integrity perspective, permitting a browser extension, if applicable, to set such an allowlist or blocklist, or users to share those, could be a viable solution.
  3. Technically speaking, I doubt it is viable to just "pass" the inferred data back to the user. Encouraing the caller to disclose a transparency report is the best a specification can do imo. Regulatory approaches might be more suitable in this case, but I believe they are beyond the scope of the discussion.
michaelkleber commented 1 year ago

This issue seems primarily focused on a wish for some kind of "ML explainability" — that is, for a person to be able to understand what decisions are being made by ML models — and then, building on that, for a person to be able to influence the signals that act as inputs to those ML models.

Those both seem like fascinating policy proposals. But neither one of them particularly has anything to do with the Topics API. Suppose an ML model has 1000 input signals today, and use of the Topics API means the model would have 1001. I don't see how anything specific to Topics would be helpful here.

The broader goals would surely be areas for regulation or policy-making about ML in general, which would then naturally apply to all 1001 inputs, Topics included.

Separately from all that, Don writes

The absence of a common topic may be a stronger signal than its presence for inferring some data points.

The Topics selection algorithm, especially the combination of random selection and per-caller filtering, make this extraordinarily unlikely.

dmarti commented 1 year ago

I'm having trouble following this.

I don't understand how the math would work out here. Is Topics API one insignificant input among 1001, or is it meaningful? Please reopen.

michaelkleber commented 1 year ago

I certainly hope that Topics is a useful signal to an ML model. If it's not, then we haven't produced something useful for the web.

But the question "Is this input useful?" is completely different from the issue you raised above, "Can the party who starts using Topics in their ML model provide the user with an explanation of the way in which that signal affects each model output?" Absent a broader context of ML explainability, the gulf between these two questions is enormous, and absolutely not in scope for this work.

dmarti commented 1 year ago

As far as scope goes, I'm looking at https://github.com/patcg-individual-drafts/topics#meeting-the-privacy-goals:

  1. Users should be able to understand the API, recognize what is being said about them, know when it’s in use, and be able to enable or disable it.

For purposes of some (most?) jurisdictions, what's being said about you includes inferred data points (see the Colorado rulemaking and resulting regs for a good example). And in order to get consent to call Topics API in consent-based jurisdictions, the caller is going to need to be able to disclose how the information obtained will be processed.

I know that this is a complicated project with a lot of unknowns but it seems like getting that existing item 4 handled well enough for a caller to implement Topics API in a compliant way would be in scope for a generally usable version. It might be possible to expand Privacy Sandbox Market Testing Grant to include researchers with access to first-party data and expertise in housing and employment discrimination?

michaelkleber commented 1 year ago

For purposes of some (most?) jurisdictions, what's being said about you includes inferred data points

I'm not a lawyer or legislator, I'm a browser engineer. From that point of view, it seems like you are trying to conflate "what party A says" with "what party B infers based on what party A says". These two seem fundamentally different, to me, because one is a reasonable question to ask of party A and the other of party B.

dmarti commented 1 year ago

In the most common uses of Topics API, Google is both A and B. Google A (browser code) encodes the user's group membership as a set of topics and passes this encoded form of the information to Google B (server-side) which decodes it.

In this type of situation, A and B together could hypothetically provide a service that implements full-stack illegal discrimination. If the company is organized in a sufficiently complex way, it could operationalize a two-room version of the Chinese Room Experiment. The browser side would be inside one room, following rules that result in assigning and passing topics without any browser developer ever understanding which groups are encoded by which topics, while the advertising side is in a separate room, decoding topics in such a way as to maximize results for advertisers, some of whom discriminate in illegal ways.

Neither A nor B intends to implement illegal discrimination, but by cleverly partitioning the two rooms the company as a whole could offer illegal discrimination to the advertisers who want it without exposing records of it to users or developers.

(edited to express as a hypothetical situation)