themains / dmoz_csv

Convert DMOZ content.rdf.u8.gz into a CSV file
1 stars 0 forks source link

domain level classification #1

Closed soodoku closed 7 years ago

soodoku commented 8 years ago

When producing domain level category, ignore URLs which are of the form http://domain/path/ and only keep http://domain

For instance,

http://www.standaard.be/Artikel/Detail.aspx?artikelId=DMF02092008_138

yields Category: World / Nederlands / Computers / Software/ Internet / Browsers / Google Chrome which is category of the article, not the domain.

To get the category of the domain: http://www.standaard.be/

look for http://www.standaard.be/ which gives the right category: Category: World / Nederlands / Nieuws en Media / Dag- en Nieuwsbladen / België

We do this sensible thing for subdomains but not domains.