[Spec] Add a maximum taxonomy size to the spec

chrisvls commented 1 year ago

The coarseness of the taxonomy is critical to the value proposition of the Topics API. Every explainer, summary, discussion, etc. that I can find describes it as such.

As written, the spec allows implementing browser to implement a taxonomy of 2,000,000,000 entries, consisting of the 2,000,000 most popular websites and 1,000 sub-sites within each. I believe that this would be well outside the intent of the project.
As this is perhaps the key expression of the contract and compromise between advertising needs and user needs, it should be recorded in the spec.
Without this limit defined in the spec, it is not reasonable to expect users to trust and understand the privacy implications of the feature.
Without this limit defined in the spec, one can expect that it will drift upward over time. There are many considerations that will exert pressure to add more entries to the taxonomies.
Also, there are interesting questions about international and regional requirements. Without scarcity, these questions are probably not getting the vetting they need.
The spec requires exact implementations of much less important behaviors – if the random topics are returned 7 percent or 10 percent or 4 percent of the time, it would not defeat the intent of the project. But this critical requirement is omitted.

I’d assert that:

500 is probably too small
50,000 is probably too large

Also:

most explainers currently describe the taxonomy as under 2,000
if the community thinks that 10,000 or 50,000 is coarse-grained, then the explainers are not setting expectations correctly, including in the TAG discussions

So I’d propose:

In section 2, define topic id as an integer between 1 and 2,000.

Or, alternatively:

In section 2, define topic id as an integer between 1 and 5,000.

jkarlin commented 1 year ago

My opinion here is that the specification should provide the algorithms and mechanisms for another user agent to be able to implement the API, but that the parameters themselves are implementer dependent. A browser may choose its own taxonomy. It could have a taxonomy of size 1 billion, but if it only ever returns 5 of those 1 billion, then there is no privacy harm. I think there should be non-normative text describing privacy concerns, to aid other user agents in their designs, but ultimately what parameters are chosen is up to them.

martinthomson commented 1 year ago

I might agree that the classifier could be independent, but there are requirements on its use that need to be specified and agreed. For instance, I don't agree that a taxonomy can be arbitrarily large, because that is an essential parameter of the differential privacy protections in the design. Maybe you believe that it is within the rights of a browser maker to decide that privacy protection is not something their users will enjoy, but I won't agree to that and nor do I think that that is an acceptable position. If this were to ever have a hope of being standardized, some minimum threshold for privacy would be a necessary component.

Also, implementing different taxonomies could have a significant effect on how the system is used. So to avoid second-order effects on people who might end up using alternative taxonomies, it would be much better to have a single agreed and specified taxonomy. For instance, if one browser never revealed a highly lucrative topic (Finance-related, say), some sites might (automatically or inadvertently, thanks ML) discriminate against the users of that browser. In addition, given how there are certain biases inherent to taxonomy selection, if I were shipping this, I'd want some sort of oversight process with multiple stakeholders to ensure that revealing a particular value does not have consequences beyond was was intended.

chrisvls commented 1 year ago

So, I think a really important thing is that the spec match the explainer and the other assurances offered. If you think that increasing the taxonomy to a billion causes no privacy harm, please change the explainers and marketing so that you are ok living with the user expectations that you yourself are setting.

chrisvls commented 1 year ago

I didn't realize the W3C TAG discussion was active again. I have summarized some of my observations about the difference between the spec and the discussion there.

chrisvls commented 1 year ago

My comments on the TAG discussion summarize some of the reasons why I am skeptical of the claim that a taxonomy of a billion items causes no privacy harms. I think there are two more that I will put here as they are more general:

If the taxonomy is all the things promised (coarse-grained, cleansed of sensitive topics, human-curated and human-reviewable), then it is low stakes for the user if other protections fail. If the browser maker accidentally exposes the data through a breach or leak or bad implementation, if the sites figure out a way to collude, etc. There is no stronger security protection against a system leaking data than not putting the data in the system.
It is easy to explain to users why they shouldn't worry about a coarse-grained taxonomy. I can explain it in a sentence. Explaining the chain of observable-caller, epoch rotation, topic randomization, etc. would be quite difficult. Explaining it in a sentence is probably impossible.

dmarti commented 1 year ago

@chrisvls It's still hard for me to understand what kind of limits on the size of taxonomy make sense, since it's possible for a caller to pass info about the user encoded as sets containing multiple topics. You might not consider an individual topic as sensitive, but a classifier on the caller's server can use it along with other topics to identify you as a member of a protected group. https://github.com/patcg-individual-drafts/topics/issues/221

jkarlin commented 1 year ago

So, I think a really important thing is that the spec match the explainer and the other assurances offered.

Ah, gotcha. I don't think the spec is going to perfectly match the explainer in that I think we'd provide some room for implementer flexibility where possible. That said, Chrome's parameters do match the explainer, and if it would be helpful we could publish Chrome's parameters in a separate document. And note that the Topics API does return the version name of the implementation it's using, so we've tried to make it easy for you to choose what to do based on the specific implementation version if you are so inclined.

When developing the API we ran studies to understand the reidentification rates of users based on the entire set of parameters. There is quite a bit of research behind the values we've chosen. So in that sense, I can see wanting to put in some reasonable constraints around those parameters. I do want to be careful not to overconstrain though. As I said before, the taxonomy could be quite large and still meet the reidentification standards by adjusting the other privacy parameters (e.g., increase the noise).

In order to not enter reidentification analysis into the spec itself, perhaps we could settle on something fairly general and simple in practice instead. Something like, "Browsers SHOULD choose the noise probability and taxonomy size so that at least 50 people per million will report any given topic in the taxonomy on a particular site and epoch. (This corresponds to 5% noise with a taxonomy of at most 1000 topics, approximately 2-3x the size of Chrome's v1 and v2 taxonomies.)"

If you think that increasing the taxonomy to a billion causes no privacy harm, please change the explainers and marketing so that you are ok living with the user expectations that you yourself are setting.

I think there was poor communication here on my part. I was just trying to illustrate that an implementation could increase the taxonomy size and make up for it in some other way (never returning a billion minus five of the topics). I'm sure you can come up with a much more realistic example. My purpose was to show that we need to be careful not to overconstrain the specification, to allow room for innovative good ideas.

chrisvls commented 1 year ago

I’m glad we’re agreed that some additional limit is needed in the spec. The open questions are, I think, should it be normative and what form should the limit(s) take?

Am I demanding some kind of “exact match” between the spec and the explainer?

No. The spec doesn’t contain a limit. That’s an omission, not an approximation. As your own research shows, the data leakage with “integer” as the only limit is not an inexact match, it’s a lot more leakage.

Should the limits on the taxonomy be non-normative or normative?

Clearly normative, because it is central to the users’ and community’s understanding and judgment of the safety and desirability of enabling the feature. Look at your discussion with the TAG, the marketing, et al. – all rely heavily on the fact that the array of Topics contains less data and less sensitive data than third-party cookies. If you advertise “the rent is low,” then you’re going to have to put a limit on the rent in the lease. The spec is the contract, not the Chrome explainers.

Should the spec use the “50 in one million" re-identification approach and relax the other requirements?

The spec currently requires set values for epoch length, noise, number of topics reported, number of topics in the top topics, etc. An alternative would be to allow implementers to choose different values or different methods of reaching the 50 in one million standard.

This would not be a good approach. First, these other methods aren’t covered by the research supporting the proposed method. Second, this more flexible standard hasn’t been discussed and would need a lot of work to prevent mischief. You could see embedded browsers gaming the system in various ways – running with very high error rates for a period of time to fill the cohorts, then switching to a low error rate and fine-grained taxonomy, etc. Thinking through a new standard is non-trivial.

So, if we’re not going to take all of these other numbers out of the spec, then the taxonomy limit doesn’t limit innovation. It’s just algebraically derivable from the 50 in a million number.

Is the “50 in one million" re-identification standard functionally equivalent to a limit on the size of the taxonomy?

No, because there are many other privacy concerns beyond re-identification. Increasing the noise rate doesn’t ameliorate the following concerns of a big taxonomy:

Sensitivity of payload. The sensitivity of the taxonomy content may be a higher user concern than the risk of re-identification. If the taxonomy is large, it is hard for the user to understand, review, trust that it doesn’t contain sensitive data.
Transparency. It is very easy to explain the feature’s safety if the taxonomy is limited. Explaining the 1 in 20,000 re-identification risk is not practical in a dialog.
Simplicity. If the user understands that the contents of the taxonomy are non-sensitive, then the user doesn’t need to worry much about the details of how the data is shared, etc.
Security. If the contents of the taxonomy are non-sensitive, then the security risk of a breach of the browser’s security (or a centralized server controlled by the browser maker) is low.

So I think that leaves us with a simple solution: a normative limit in the spec on the size of the taxonomy. This seems to flow unavoidably from how the feature has been marketed, studied, and explained. But it is also superior to the alternatives.

chrisvls commented 1 year ago

@dmarti You're right, picking a limit is tricky. That's why this is a very important part of the spec.

In the content management and enterprise application space, I have seen many features proposed where the solution would rely on a limited taxonomy. But, the features fail because it is not possible to craft a taxonomy that all user groups can agree on (the data users want it to be simple, the reporting users want it to be too big or too complex). Or the taxonomy can be agreed, but it becomes almost immediately out-of-date. Or a limited taxonomy can be agreed in principle but the requirements of different languages, locales, business units can't converge.

By looking at whether a taxonomy limit can be agreed on now, we can determine if the whole feature will really work in practice. If we think the taxonomy limit is an unsolvable problem, then the feature will probably founder on one or more of these shores.

chrisvls commented 11 months ago

Interestingly, the Chrome UX does not, in fact, let the user review the taxonomy in its permissions UX, at least in non-GDPR areas. Oddly, it gives the user control to suppress sharing a topic, but perhaps only after Chrome has already shared that topic at least once.

Setting that choice aside, @jkarlin it is still Google's intention to limit the size of the taxonomy for Chrome? If so, any response on why it might as well limit it in the spec, given the current form of the spec and the arguments for the feature (see comments above)? If not, are there plans to update the TAG explainer, proposal, spec, etc?

patcg-individual-drafts / topics

[Spec] Add a maximum taxonomy size to the spec #229