Inquiry Regarding Use of Topics API Model for HTTP Archive

patcg-individual-drafts / topics

The Topics API

https://patcg-individual-drafts.github.io/topics/

Other

607 stars 214 forks source link

Inquiry Regarding Use of Topics API Model for HTTP Archive #305

Open nrllh opened 5 months ago

nrllh commented 5 months ago

Hello,

I am Nurullah from HTTP Archive, and we are planning to use Topics API model to categorize webpages for the 2024 Web Almanac project.

Our goal is to utilize the Topics API model to determine the categories of the CrUX origins in HTTP Archive. We intend to classify the origins similar to the one discussed here. The results of this classification will be stored and made publicly available in BigQuery, primarily for use by the Web Almanac analysts.

Before proceeding, we want to ensure that this use case does not violate any terms of use or raise other concerns regarding the Topics API. Could you provide guidance or confirm whether there are any potential issues with utilizing the Topics API in this manner?

Appreciate your support on this matter.

leeronisrael commented 5 months ago

Hi Nurullah - I'm looking into this and will get back to you soon.

leeronisrael commented 4 months ago

As you know, the Topics API classification model is shipped alongside the Chrome browser, in order to facilitate the on-device generation of topics. All the code to use the model is within the Chromium source tree which is subject to the Chromium open source license. There is no technical barrier to any party utilizing the model purposes beyond Topics API. In production, Chrome uses an override list in order to improve performance - this list does not exist in the Chromium source tree.

nrllh commented 4 months ago

Thank you, Leeron! That sounds cool. I will share our data with you as well once we are done.

nrllh commented 2 months ago

Thank you once again, @leeronisrael. We processed all the URLs in our dataset and made it open-source. Check the documentation: https://har.fyi/reference/functions/get_host_categories/

There have been some discussions on the accuracy of the model, but I couldn't find any related stats. Do you have any statistics on this that you can provide?