patcg-individual-drafts / topics

The Topics API
605 stars 199 forks source link

How to fix inconsistent categorization on a news website #321

Closed jj-OF closed 2 months ago

jj-OF commented 2 months ago

Hi everyone I hope I'm not asking a question that's already been asked 100 times.

We observe a classification that is strange for our website It is only classified as local news (245) but not national news (243), international news (249) or sports (299). | 245. Actualités locales | 245. Actualités locales 243. Actualités | 245. Actualités locales 249. Actualités internationales 243. Actualités | 245. Actualités locales 243. Actualités | 245. Actualités locales 243. Actualités | 245. Actualités locales 243. Actualités | 245. Actualités locales 243. Actualités | 243. Actualités | 243. Actualités | 245. Actualités locales 243. Actualités | 243. Actualités | 247. Politique 239. Droit et administration publique 245. Actualités locales 249. Actualités internationales 243. Actualités | 243. Actualités | 363. TV et vidéo 243. Actualités | 243. Actualités | 243. Actualités | 243. Actualités | 243. Actualités | 243. Actualités | 299. Sports 243. Actualités

The site is however one of the 3 main news sites in France (ACPM figures) and publishes a significant part of its content (at least 20 to 25%) in the international news, national news and sport sectors.

If we refer to the categorizations of our colleagues, the classification seems abnormal.

Several of this colleagues who mainly produce local and regional content and have at least the News (243) categorization.

It is, from what I see, apparently the only French press title which stands out with the only Local News categorization while it is the regional daily press title which produces, by far, the most national and international content. and sports.

Is this linked, as I read in some posts, to our domain name ?

Is there a way to correct this erroneous classification ?

Thanks in advance.

jkarlin commented 2 months ago

Hi, thanks for the report!

First, I'd like to point out that Chrome's determination of a site's topics is merely to be used as input to its Topics algorithm for determining a user's interests for advertising purposes. It is not developed for other, more general classification purposes.

We're interested in the overall accuracy of our classification model, and try to improve their precision/recall where possible, but at the global level as opposed to the individual site classification level. This is because misclassification, when it does happen, does not harm the individual site that has been misclassified, rather it reduces the quality of the Topics signal when selecting an advertisement on other sites. When selecting ads on the misclassified site, the real topics of the site are already known to that site, and can be used as input to advertising queries.

For the reasons above, we do not modify our model for individual mistakes when they occur.

dmarti commented 2 months ago

Yes, Topics API is similar to FLOC but with a cohort ID in base 629, not base 10. The individual topic names don't matter, just like knowing the individual digits of your FLOC cohort ID would have helped you know what your cohort ID means. What matters for ad selection (and possibly other decisions) is the complete topics set (as seen by a caller) that is reported for a user. Misclassifications on the browser side end up feeding into ML operated by the caller in consistent ways.

As a user you don't really know what your topic set cohort is telling the caller's ML about you (#221) but as the caller you can create and process inferences about cohorts knowing that the same browser version will misclassify the same sites in the same way.