mzhilyaev / sim-sites-download

3 stars 1 forks source link

Ignore bogus titles #1

Open mzhilyaev opened 10 years ago

mzhilyaev commented 10 years ago

These titles should be ignored and DMOZ ones should be used instead

71 "title": "Home", 59 "title": "Access Denied", 36 "title": "406 Not Acceptable", 14 "title": "404 Not Found", 13 "title": "Google", 9 "title": "Suspicious Activity Detected", 7 "title": "Homepage", 7 "title": "Apache HTTP Server Test Page powered by CentOS", 6 "title": "Not Found", 6 "title": "Home Page", 5 "title": "Search", 5 "title": "Just a moment...", 5 "title": "Experience faster downloading and immediate viewing with an awesome Free Download Manager.", 4 "title": " Error ", 4 "title": "Welcome to Facebook - Log In, Sign Up or Learn More", 4 "title": "Request Rejected", 4 "title": "Guia del Consumidor - Madre Soltera Gana Trabajando Desde Su Casa En Sus Horas Libres $7,438 Dólares Al Mes", 4 "title": "Error", 4 "title": "Error 406 - Not Acceptable", 4 "title": "Contact Us", 4 "title": "403 - Forbidden: Access is denied.", 3 "title": "Sign In", 3 "title": "Read Manga Online", 3 "title": "ProBoards", 3 "title": "News", 3 "title": "MCSV | MailChimp", 3 "title": "Hotels.com - Cheap Hotel Deals: Hotels, Motels, Extended Stay", 3 "title": "Eventbrite - Discover Great Events or Create Your Own & Sell Tickets", 3 "title": "Error 404 (Not Found)!!1",

mzhilyaev commented 10 years ago

this one too:

"title": "Redirect"

mzhilyaev commented 10 years ago

"title": "An error has occurred" "title": "Bad Request" "title": "Landing" 2 "title": "404 Page Not Found", 2 "title": "404 - Not Found", 2 "title": "404 - File or directory not found.", 2 "title": "401 - δÊÚȨ: ÓÉÓÚƾ¾ÝÎÞЧ£¬·ÃÎʱ»¾Ü¾ø¡£", 2 "title": "503 Service Temporarily Unavailable", 3 "title": "(no title)",

Mardak commented 10 years ago

What's bogus about Hotels.com or Eventbrite? Plenty of sites have a description in their title, so I'm not sure if it's more of a question of correctly splitting the site's title from description.

mzhilyaev commented 10 years ago

Yeap, my mistake - removed those titles from the black list.

Mardak commented 10 years ago

abc.go.com has a title of title: 'Hey Facebook! Come watch ABC Home Schedule And Shows Pages',

Where does that come from? alexa seems to have "ABCNews.com" and the page itself is "ABC News"