I am re-opening here a bug I had initially reported through the Chrome Platform Status as it remained unanswered there but the issue appears to still be present as far as I can tell.
Problem Description
Google ships with Chrome an override list, which is a mapping manually annotated by Google of domains to their corresponding topics. When a domain is being classified by the Topics API, it is checked against that list first and if not found classified by the ML model.
The override list shipped in Chrome with the 4th version of the Topics model contains 47 128 such individual mappings from domains to topics. However, it appears that 625 of them are not formatted correctly as they contain the "/" invalid character (according to the pre-processing rules applied to the input) and are flipped around that invalid character. As a result, when trying to classify the corresponding initial domain by applying the pre-processing to the input domain, no match is found in the override list. Thus, the ML model is used for the classification and different results than the manual annotations are output.
Here are some steps you can follow to reproduce the issue and the misclassifications due to the incorrectly formatted entries in the override list of the Topics API feature.
Get the override_list.pb.gz shipped with model v4 (current version of the model.tflite for the Topics API shipped in Chrome).
Visit chrome://topics-internals, under the Classifier tab, get the path where the model is stored, in that same folder you will find the override_list.pb.gz file
For archival purposes, the file can also be found here
Then, check that the override list contains domains with the invalid "/" character and that these domains are flipped around that character, you will need to decompress the protobuf file, get the corresponding .proto, and decode it.
Here is how I do it, you will need these 2 scripts:
import argparse
import pandas as pd
import page_topics_override_list_pb2
# Create Argument Parser
parser = argparse.ArgumentParser(
prog="python3 convert_pb_override.py",
description="Convert .pb override list to .tsv",
)
parser.add_argument("input_file", help="input file")
args = parser.parse_args()
# Load override list
override_list = page_topics_override_list_pb2.PageTopicsOverrideList()
with open(args.input_file, "rb") as f:
override_list.ParseFromString(f.read())
print("domain\ttopics")
for entry in override_list.entries:
line = "{}".format(entry.domain)
first_topic = True
for id in entry.topics.topic_ids:
if first_topic:
line += "\t{}".format(id)
first_topic = False
else:
line += ",{}".format(id)
print(line)
And then run:
# Decode override list to .tsv format
./convert_pv_override.sh override_list.pb.gz override_list.tsv
# Extract domains (and corresponding topics) with invalid character:
grep ".*[^[:alpha:][:space:][:digit:]^,].*" override_list.tsv
The domain entries in that override list are supposed to be pre-processed the same way as the input that would be passed to the model.tflite of the Topics API.
This means: take the FQDN, remove any "www." prefix if present in the domain to classify, and then replace the following characters "-", "_", ".", "+" by a whitespace (https://source.chromium.org/chromium/chromium/src/+/main:components/browsing_topics/annotator_impl.cc;l=269).
Some examples:
candy-crush-soda-saga.web.app -> candy crush soda saga web app and not web app/candy crush soda saga
subscribe.free.fr -> subscribe free fr and not: free fr/subscribe
uk.instructure.com -> uk instructure com and not: instructure com/uk
As a result, when these domains are classified by the Topics API in Chrome, no match is found in the override list for the domain correctly pre-processed.
Thus, they are classified by the ML model which does not output the intended classification (Chrome classification can be obtained from chrome://topics-internals):
candy-crush-soda-saga.web.app -> 183. Computer & video games - 186. Casual games - 1. Arts & entertainment by ML model -> 186. Casual games - 215. Internet & telecom from override list for web app/candy crush soda saga
subscribe.free.fr -> 217. Internet service providers (ISPs) by ML model -> 217. Internet service providers (ISPs) - 365. Movie & TV streaming - 129. Consumer electronics - 218.Phone service providers from override list for free fr/subscribe
uk.instructure.com -> 229. Colleges & universities - 227. Education by ML model -> 230. Distance learning - 234. Standardized & admissions tests - 140. Software - 227. Education - 229. Colleges & universities from override list for instructure com/uk
Hello,
I am re-opening here a bug I had initially reported through the Chrome Platform Status as it remained unanswered there but the issue appears to still be present as far as I can tell.
Problem Description
Google ships with Chrome an override list, which is a mapping manually annotated by Google of domains to their corresponding topics. When a domain is being classified by the Topics API, it is checked against that list first and if not found classified by the ML model.
The override list shipped in Chrome with the 4th version of the Topics model contains 47 128 such individual mappings from domains to topics. However, it appears that 625 of them are not formatted correctly as they contain the
"/"
invalid character (according to the pre-processing rules applied to the input) and are flipped around that invalid character. As a result, when trying to classify the corresponding initial domain by applying the pre-processing to the input domain, no match is found in the override list. Thus, the ML model is used for the classification and different results than the manual annotations are output.Overview and some more details on this blog post
Steps to reproduce
Here are some steps you can follow to reproduce the issue and the misclassifications due to the incorrectly formatted entries in the override list of the Topics API feature.
Get the
override_list.pb.gz
shipped with model v4 (current version of themodel.tflite
for the Topics API shipped in Chrome).chrome://topics-internals
, under the Classifier tab, get the path where the model is stored, in that same folder you will find theoverride_list.pb.gz
fileThen, check that the override list contains domains with the invalid
"/"
character and that these domains are flipped around that character, you will need to decompress the protobuf file, get the corresponding.proto
, and decode it. Here is how I do it, you will need these 2 scripts:convert_pb_override.sh
bash script:convert_pb_override.py
python script:The domain entries in that override list are supposed to be pre-processed the same way as the input that would be passed to the
model.tflite
of the Topics API. This means: take the FQDN, remove any"www."
prefix if present in the domain to classify, and then replace the following characters"-", "_", ".", "+"
by a whitespace (https://source.chromium.org/chromium/chromium/src/+/main:components/browsing_topics/annotator_impl.cc;l=269).Some examples:
candy-crush-soda-saga.web.app
->candy crush soda saga web app
and notweb app/candy crush soda saga
subscribe.free.fr
->subscribe free fr
and not:free fr/subscribe
uk.instructure.com
->uk instructure com
and not:instructure com/uk
As a result, when these domains are classified by the Topics API in Chrome, no match is found in the override list for the domain correctly pre-processed. Thus, they are classified by the ML model which does not output the intended classification (Chrome classification can be obtained from
chrome://topics-internals
):candy-crush-soda-saga.web.app
->183. Computer & video games - 186. Casual games - 1. Arts & entertainment
by ML model ->186. Casual games - 215. Internet & telecom
from override list forweb app/candy crush soda saga
subscribe.free.fr
->217. Internet service providers (ISPs)
by ML model ->217. Internet service providers (ISPs) - 365. Movie & TV streaming - 129. Consumer electronics - 218.Phone service providers
from override list forfree fr/subscribe
uk.instructure.com
->229. Colleges & universities - 227. Education
by ML model ->230. Distance learning - 234. Standardized & admissions tests - 140. Software - 227. Education - 229. Colleges & universities
from override list forinstructure com/uk