patcg-individual-drafts / topics

The Topics API
https://patcg-individual-drafts.github.io/topics/
Other
605 stars 199 forks source link

Hostnames may be private #185

Closed chrisvls closed 1 year ago

chrisvls commented 1 year ago
  1. Intranet hostnames may include private information, like project names, division names, application names.
  2. Intranet hostnames may include information that is difficult for a third-party to accurately assign to topics.
  3. SaaS applications that use subdomains may include significant information in the hostname. If, for example, it were known that an M&A deal room application had a hostname of "siliconvalleybank.dealroomapp.com", it could move the stock market.
  4. For some SaaS applications, the intent or consent to share topic information may be at the instance or even user level. Some instances of a wiki might not be sensitive. Others may be.
  5. Many site terms of service would prohibit disclosure of the hostname or sufficient information to assign a topic.

How does the Topics API envision managing these aspects?

chrisvls commented 1 year ago

Similar to 3 above, it is also common for SaaS contracts to prohibit disclosure that company X is a customer of App Y except as authorized. As a result, disclosing "companyX.SaaSApplicationY.com" would violate pretty standard clauses governing this kind of confidentiality.

jkarlin commented 1 year ago

Thanks for reaching out! Responses inline:

Intranet hostnames may include private information, like project names, division names, application names. Intranet hostnames may include information that is difficult for a third-party to accurately assign to topics.

We have a few layers of protection here. The first, is that we don't classify hostnames that resolve to private IP address space (IANA reserved address ranges). Many intranets exist in such reserved ranges. Second, I don't expect that these intranet sites will be serving ads and calling the browsingTopics API, and therefore won't be included in the user's top topic calculation. Third, the taxonomy is rather coarse grained. Fourth, we introduce noised topics so one doesn't know which sites the user actually visited. And finally, the user may have visited any one (or multiple) of a number of sites about said topic. It's at best a probabilistic inference which site a user visited.

SaaS applications that use subdomains may include significant information in the hostname. If, for example, it were known that an M&A deal room application had a hostname of "siliconvalleybank.dealroomapp.com", it could move the stock market.

Similar to above. The topics provided by the taxonomy are very coarse grained. The Topics API currently classifies siliconvalleybank.dealroomapp.com as "149. Finance". That doesn't make it clear whether the topic came from the etld+1, or the subdomain, or especially which particular bank it was.

For some SaaS applications, the intent or consent to share topic information may be at the instance or even user level. Some instances of a wiki might not be sensitive. Others may be. Many site terms of service would prohibit disclosure of the hostname or sufficient information to assign a topic.

Those apps/sites can disable topics (e.g., via the permissions policy API) on instances where consent is not given or disclosure is prohibited. When calculating the next set of Topics for the user, the API only considers hostnames from those pages in which the API is called and the permission policy grants the call and the IP address is not reserved and the user is not in incognito mode and the user hasn't disabled permission etc.

chrisvls commented 1 year ago

Thank you for your thoughtful response. I had misunderstood a very basic aspect -- that the classification occurs within the browser not by a service running elsewhere. (I think I saw the "public" and "by a partner" and jumped to conclusions.)

I will add though that it seems that there is an argument that the default permissions policy should be to deny extraction and sharing that data during the experimentation phase.

This leads more broadly to a comment that I'll make elsewhere on the spec (when I'm more confident I haven't missed basic points, like the above ;) )... the safety of this API relies very heavily on the coarseness of the topic taxonomy and relies somewhat on the coarseness of the topics calculation input data.

But the spec makes no promise that the taxonomy will remain coarse. Indeed, it highlights that accepting the spec entails accepting changes to the taxonomy. And the spec explicitly states that the input data could include all text in the document.

chrisvls commented 1 year ago

Oh, I would also add that lots of internal apps and intranets are hosted on public cloud infrastructure. We may not be in a zero-trust world, but we are definitely in a world where lots of what-used-to-be internal-only apps are accessible from anywhere and not ip blocked.