mozilla / page-metadata-service

DEPRECATED - A RESTful service that returns the metadata about a given URL.
Mozilla Public License 2.0
19 stars 8 forks source link

audit top 20-50 content sites for lead image, title, description on metadata service server #84

Closed edwindotcom closed 2 years ago

edwindotcom commented 7 years ago

as we move to showing more highlights, we should make sure that top sites have proper meta data on the fathom metadata service, such as wikipedia etc...

pdehaan commented 7 years ago

Related: https://github.com/pdehaan/get-alexa-top-sites :poop::fire:

Not sure how super helpful that is, since it's just the homepage of the top 1m domains, versus scraping an inner page of those sites. But may be a decent-enough starting point.

pdehaan commented 7 years ago

OK, batching this isn't as easy as I expected. After prepending "https://" to each of the top sites and grabbing the first 20 (which is our per-request proxy limit), I'm just getting HTTP/504 GATEWAY_TIMEOUT errors; even if I drop it down to 10 requests.

$ time http https://metadata.dev.mozaws.net/v1/metadata urls:=@topsites/top-20.json -j -v --follow --timeout=300

POST /v1/metadata HTTP/1.1
Accept: application/json
Accept-Encoding: gzip, deflate
Connection: keep-alive
Content-Length: 233
Content-Type: application/json; charset=utf-8
Host: metadata.dev.mozaws.net
User-Agent: HTTPie/0.9.1

{
    "urls": [
        "https://google.com",
        "https://youtube.com",
        "https://facebook.com",
        "https://baidu.com",
        "https://yahoo.com",
        "https://amazon.com",
        "https://wikipedia.org",
        "https://qq.com",
        "https://twitter.com",
        "https://google.co.in"
    ]
}

HTTP/1.1 504 GATEWAY_TIMEOUT
Connection: keep-alive
Content-Length: 0

http https://metadata.dev.mozaws.net/v1/metadata urls:=@topsites/top-20.json
0.16s user
0.07s system
0% cpu
1:00.04 total

Looks like the proxy times the request out after 60s, which doesn't seem to be enough time to process 10 incoming requests in a batch.

pdehaan commented 7 years ago

OK, some success...

Click-o to expand-o: ``` sh $ time http https://metadata.dev.mozaws.net/v1/metadata urls:=@topsites/top-20.json -j -v --follow --timeout=300 POST /v1/metadata HTTP/1.1 Accept: application/json Accept-Encoding: gzip, deflate Connection: keep-alive Content-Length: 434 Content-Type: application/json; charset=utf-8 Host: metadata.dev.mozaws.net User-Agent: HTTPie/0.9.1 { "urls": [ "http://google.com", "http://youtube.com", "http://facebook.com", "http://baidu.com", "http://yahoo.com", "http://amazon.com", "http://wikipedia.org", "http://qq.com", "http://twitter.com", "http://google.co.in", "http://live.com", "http://taobao.com", "http://google.co.jp", "http://bing.com", "http://sina.com.cn", "http://instagram.com", "http://linkedin.com", "http://weibo.com", "http://yahoo.co.jp", "http://msn.com" ] } HTTP/1.1 200 OK Connection: keep-alive Content-Length: 7955 Content-Type: application/json; charset=utf-8 Date: Mon, 29 Aug 2016 20:49:41 GMT ETag: W/"1f13-tByuNopIgKBKMEFO/byeYA" Server: nginx/1.9.9 Strict-Transport-Security: max-age=31536000 { "request_error": "", "url_errors": {}, "urls": { "http://amazon.com": { "description": "Online shopping from the earth's biggest selection of books, magazines, music, DVDs, videos, electronics, computers, software, apparel & accessories, shoes, jewelry, tools & hardware, housewares, furniture, sporting goods, beauty & personal care, broadband & dsl, gourmet food & just about anything else.", "favicon_url": "http://amazon.com/favicon.ico", "images": [ { "entropy": 1, "height": 500, "url": "http://g-ec2.images-amazon.com/images/G/01/social/api-share/amazon_logo_500500._V323939215_.png", "width": 500 } ], "original_url": "http://amazon.com", "title": "Amazon.com: Online Shopping for Electronics, Apparel, Computers, Books, DVDs & more", "url": "http://amazon.com" }, "http://baidu.com": { "favicon_url": "http://baidu.com/favicon.ico", "images": [], "original_url": "http://baidu.com", "url": "http://baidu.com" }, "http://bing.com": { "description": "Bing helps you turn information into action, making it faster and easier to go from searching to doing.", "favicon_url": "http://bing.com/fd/s/a/hp/bing.svg", "images": [], "original_url": "http://bing.com", "title": "Bing", "url": "http://bing.com" }, "http://facebook.com": { "description": "Create an account or log into Facebook. Connect with friends, family and other people you know. Share photos and videos, send messages and get updates.", "favicon_url": "https://static.xx.fbcdn.net/rsrc.php/yV/r/hzMapiNYYpW.ico", "images": [ { "entropy": 1, "height": 500, "url": "https://www.facebook.com/images/fb_icon_325x325.png", "width": 500 } ], "original_url": "http://facebook.com", "title": "Facebook - Log In or Sign Up", "url": "http://facebook.com" }, "http://google.co.in": { "favicon_url": "http://google.co.in/images/branding/product/ico/googleg_lodp.ico", "images": [], "original_url": "http://google.co.in", "title": "Google", "url": "http://google.co.in" }, "http://google.co.jp": { "description": "世界中のあらゆる情報を検索するためのツールを提供しています。さまざまな検索機能を活用して、お探しの情報を見つけてください。", "favicon_url": "http://google.co.jp/images/branding/product/ico/googleg_lodp.ico", "images": [], "original_url": "http://google.co.jp", "title": "Google", "url": "http://google.co.jp" }, "http://google.com": { "description": "Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for.", "favicon_url": "http://google.com/images/branding/product/ico/googleg_lodp.ico", "images": [], "original_url": "http://google.com", "title": "Google", "url": "http://google.com" }, "http://instagram.com": { "favicon_url": "http://instagram.com/h1/images/ico/apple-touch-icon-76x76-precomposed.png/932e4d9af891.png", "images": [], "original_url": "http://instagram.com", "title": "Instagram", "url": "http://instagram.com" }, "http://linkedin.com": { "description": "400 million+ members | Manage your professional identity. Build and engage with your professional network. Access knowledge, insights and opportunities.", "favicon_url": "https://static.licdn.com/scds/common/u/img/icon/apple-touch-icon.png", "images": [], "original_url": "http://linkedin.com", "title": "World’s Largest Professional Network | LinkedIn", "url": "http://linkedin.com" }, "http://live.com": { "description": "Outlook.com is a free, personal email service from Microsoft. Keep your inbox clutter-free with powerful organizational tools, and collaborate easily with OneDrive and Office Online integration.", "favicon_url": "https://auth.gfx.ms/16.000.26513.01/favicon.ico?v=2", "images": [], "original_url": "http://live.com", "title": "Sign In", "url": "http://live.com" }, "http://msn.com": { "description": "The new MSN, Your customizable collection of the best in news, sports, entertainment, money, weather, travel, health, and lifestyle, combined with Outlook, Facebook, Twitter, Skype, and more.", "favicon_url": "http://msn.com/favicon.ico", "images": [], "original_url": "http://msn.com", "title": "MSN.com - Hotmail, Outlook, Skype, Bing, Latest News, Photos & Videos", "url": "http://msn.com" }, "http://qq.com": { "description": "腾讯网(www.QQ.com)是中国浏览量最大的中文门户网站,是腾讯公司推出的集新闻信息、互动社区、娱乐产品和基础服务为一体的大型综合门户网站。腾讯网服务于全球华人用户,致力成为最具传播力和互动性,权威、主流、时尚的互联网媒体平台。通过强大的实时新闻和全面深入的信息资讯服务,为中国数以亿计的互联网用户提供富有创意的网上新生活。", "favicon_url": "http://mat1.gtimg.com/www/icon/favicon2.ico", "images": [], "original_url": "http://qq.com", "title": "腾讯首页", "url": "http://qq.com" }, "http://sina.com.cn": { "description": "新浪网为全球用户24小时提供全面及时的中文资讯,内容覆盖国内外突发新闻事件、体坛赛事、娱乐时尚、产业资讯、实用信息等,设有新闻、体育、娱乐、财经、科技、房产、汽车等30多个内容频道,同时开设博客、视频、论坛等自由互动交流空间。", "favicon_url": "http://i3.sinaimg.cn/home/2013/0331/U586P30DT20130331093840.png", "images": [], "original_url": "http://sina.com.cn", "title": "新浪首页", "url": "http://sina.com.cn" }, "http://taobao.com": { "description": "淘寶海外全球站提供流行服飾、鞋包配件、3C、居家、母嬰、運動等上億件商品,服務於美國、香港、台灣、馬來西亞等幾十個國家和地區,讓您享受網路超低價,同時提供擔保交易(先收貨後付款)、先行賠付、假一賠三、七天無理由退換貨、數碼免費維修等安全交易保障服務,安心享受購物樂趣!", "favicon_url": "http://taobao.com/favicon.ico", "images": [ { "entropy": 1, "height": 500, "url": "https://img.alicdn.com/tps/TB1Bh1KLFXXXXb4XpXXXXXXXXXX-470-246.jpg", "width": 500 } ], "original_url": "http://taobao.com", "title": "淘寶海外全球站 - 購物首選,淘你喜歡!", "url": "http://taobao.com" }, "http://twitter.com": { "description": "From breaking news and entertainment to sports and politics, get the full story with all the live commentary.", "favicon_url": "http://twitter.com/favicons/favicon.ico", "images": [], "original_url": "http://twitter.com", "title": "Twitter - see what's happening", "url": "http://twitter.com" }, "http://weibo.com": { "favicon_url": "http://weibo.com/favicon.ico", "images": [], "original_url": "http://weibo.com", "title": "Sina Visitor System", "url": "http://weibo.com" }, "http://wikipedia.org": { "favicon_url": "http://wikipedia.org/static/apple-touch/wikipedia.png", "images": [], "original_url": "http://wikipedia.org", "title": "Wikipedia", "url": "http://wikipedia.org" }, "http://yahoo.co.jp": { "description": "日本最大級のポータルサイト。検索、オークション、ニュース、メール、コミュニティ、ショッピング、など80以上のサービスを展開。あなたの生活をより豊かにする「ライフ・エンジン」を目指していきます。", "favicon_url": "http://yahoo.co.jp/favicon.ico", "images": [], "original_url": "http://yahoo.co.jp", "title": "Yahoo! JAPAN", "url": "http://yahoo.co.jp" }, "http://yahoo.com": { "description": "News, email and search are just the beginning. Discover more every day. Find your yodel.", "favicon_url": "https://s.yimg.com/os/mit/media/p/common/images/favicon_new-7483e38.svg", "images": [ { "entropy": 1, "height": 500, "url": "https://s.yimg.com/dh/ap/default/130909/y_200_a.png", "width": 500 } ], "original_url": "http://yahoo.com", "title": "Yahoo", "url": "http://yahoo.com" }, "http://youtube.com": { "description": "Enjoy the videos and music you love, upload original content, and share it all with friends, family, and the world on YouTube.", "favicon_url": "http://youtube.com/yts/img/favicon_32-vfl8NGn4k.png", "images": [ { "entropy": 1, "height": 500, "url": "http://s.ytimg.com/yts/img/yt_1200-vfl4C3T0K.png", "width": 500 } ], "original_url": "http://youtube.com", "title": "YouTube", "url": "http://youtube.com" } } } http https://metadata.dev.mozaws.net/v1/metadata urls:=@topsites/top-20.json 0.18s user 0.11s system 4% cpu 6.739 total ```

Another sketchy approach we could do would be to add some default Top Sites banner images for any Tippy-Top-Sites which may be missing them. At least that way we'd have a fallback image we could have for google, twitter, wikipedia, etc.

pdehaan commented 7 years ago

For those of you [like me] who can't read a giant block of JSON, here's a pleasant report from Joi validation:

`$ node lint-metadata.js` ## http://amazon.com :+1: ## http://baidu.com :-1: - "description" is required - "images" must contain at least 1 items - "title" is required ## http://bing.com :-1: - "images" must contain at least 1 items ## http://facebook.com :+1: ## http://google.co.in :-1: - "description" is required - "images" must contain at least 1 items ## http://google.co.jp :-1: - "images" must contain at least 1 items ## http://google.com :-1: - "images" must contain at least 1 items ## http://instagram.com :-1: - "description" is required - "images" must contain at least 1 items ## http://linkedin.com :-1: - "images" must contain at least 1 items ## http://live.com :-1: - "images" must contain at least 1 items ## http://msn.com :-1: - "images" must contain at least 1 items ## http://qq.com :-1: - "images" must contain at least 1 items ## http://sina.com.cn :-1: - "images" must contain at least 1 items ## http://taobao.com :+1: ## http://twitter.com :-1: - "images" must contain at least 1 items ## http://weibo.com :-1: - "description" is required - "images" must contain at least 1 items ## http://wikipedia.org :-1: - "description" is required - "images" must contain at least 1 items ## http://yahoo.co.jp :-1: - "images" must contain at least 1 items ## http://yahoo.com :+1: ## http://youtube.com :+1:
`lint-metadata.js`: ``` js const Joi = require('joi'); const metadata = require('./metadata.json'); const schema = Joi.object().keys({ description: Joi.string().required(), favicon_url: Joi.string().uri().required(), images: Joi.array().min(1).required(), original_url: Joi.string().uri().required(), title: Joi.string().required(), url: Joi.string().uri().required() }); Object.keys(metadata.urls).forEach((url) => { const site = metadata.urls[url]; Joi.validate(site, schema, { abortEarly: false, allowUnknown: false }, (err, res) => { console.log(`\n## ${url} %s`, err ? ':-1:' : ':+1:'); if (err) { err.details.forEach(({message}) => { console.log('- %s', message); }) return; } }); }); ```
pdehaan commented 7 years ago

I actually got a better data dump of "random" URLs from @nchapman which gives much better results (mainly because they're inner article pages and not just homepages).

**Input:** Here's a sub list of the top 10 sites Nick scraped, then I chose the top 2 results for each site. Which is why my dataset only has 16 URLs instead of 20. ``` json [ "https://www.theguardian.com/us-news/ng-interactive/2015/jun/01/the-counted-police-killings-us-database", "https://www.theguardian.com/sport/2016/aug/21/lynsey-sharp-caster-semenya-rio-2016-olympics", "http://www.smithsonianmag.com/history/for-40-years-this-russian-family-was-cut-off-from-all-human-contact-unaware-of-world-war-ii-7354256/?no-ist", "http://drudgereport.com/", "http://qz.com/252456/what-it-feels-like-to-be-the-last-generation-to-remember-life-before-the-internet/", "http://qz.com/767812/millennial-whoop/", "http://www.littlethings.com/dad-builds-wheelchair/", "http://www.littlethings.com/restored-queen-anne-house/", "http://www.newyorker.com/humor/daily-shouts/everything-i-am-afraid-might-happen-if-i-ask-new-acquaintances-to-get-coffee", "http://www.newyorker.com/culture/culture-desk/watching-canadas-biggest-rock-band-say-a-dramatic-goodbye?mbid=rss", "http://www.npr.org/sections/13.7/2016/08/22/490847797/why-do-we-judge-parents-for-putting-kids-at-perceived-but-unreal-risk", "http://www.npr.org/sections/thetorch/2016/08/21/490818961/u-s-women-are-the-biggest-winners-in-rio-olympics?utm_medium=RSS&utm_campaign=news", "http://tranquilmonkey.com/hunter-s-thompsons-extraordinary-letter-on-finding-your-purpose/", "http://www.businessinsider.com/most-meaningful-jobs-in-america-2015-7", "http://www.businessinsider.com/10-facts-that-prove-the-absurdity-of-pablo-escobars-wealth-2015-9", "https://smiledirectclub.com/?utm_source=digg&utm_medium=referral&utm_campaign=homepage-web" ] ``` ... And our returned metadata: ``` json { "http://drudgereport.com/": { "favicon_url": "http://drudgereport.com/favicon.ico", "images": [], "original_url": "http://drudgereport.com/", "title": "DRUDGE REPORT 2016®", "url": "http://drudgereport.com/" }, "http://qz.com/252456/what-it-feels-like-to-be-the-last-generation-to-remember-life-before-the-internet/": { "description": "Technology has a lot to answer for: killing old businesses, destroying the middle class, Buzzfeed. Technology in the form of the internet is especially villainous, having been accused of everything from making us dumber (paywall) to aiding dictatorships. But Michael Harris, riffing on the observations of Melvin Kranzberg, argues that \"technology is neither good nor evil. The most...", "favicon_url": "http://app.qz.com/img/icons/touch_144.png", "images": [ { "entropy": 1, "height": 500, "url": "https://qzprod.files.wordpress.com/2014/08/michael-harris.jpg?w=1600", "width": 500 } ], "original_url": "http://qz.com/252456/what-it-feels-like-to-be-the-last-generation-to-remember-life-before-the-internet/", "title": "What it feels like to be the last generation to remember life before the internet", "url": "http://qz.com/252456/what-it-feels-like-to-be-the-last-generation-to-remember-life-before-the-internet/" }, "http://qz.com/767812/millennial-whoop/": { "description": "Once you hear it, you can't un-hear it.", "favicon_url": "http://app.qz.com/img/icons/touch_144.png", "images": [ { "entropy": 1, "height": 500, "url": "https://qzprod.files.wordpress.com/2016/08/katy-perry.jpg?w=1600", "width": 500 } ], "original_url": "http://qz.com/767812/millennial-whoop/", "title": "Listen: This same annoying whooping sound is in every popular song", "url": "http://qz.com/767812/millennial-whoop/" }, "http://tranquilmonkey.com/hunter-s-thompsons-extraordinary-letter-on-finding-your-purpose/": { "description": "In April of 1958, a 22 year-old Hunter S. Thompson wrote a letter on the meaning of life when asked by a friend for advice. What makes his response all the more profound is the fact that at the time, the world had no idea that he would become one of the most important writers of the 20th …", "favicon_url": "http://tranquilmonkey.com/wp-content/uploads/2016/05/SPACETM-MIKRO.png", "images": [ { "entropy": 1, "height": 500, "url": "http://tranquilmonkey.com/wp-content/uploads/2016/07/hunter__thompson.jpg", "width": 500 } ], "original_url": "http://tranquilmonkey.com/hunter-s-thompsons-extraordinary-letter-on-finding-your-purpose/", "title": "On Finding Your Purpose: An Extraordinary Letter by Hunter S. Thompson", "url": "http://tranquilmonkey.com/hunter-s-thompsons-extraordinary-letter-on-finding-your-purpose/" }, "http://www.businessinsider.com/10-facts-that-prove-the-absurdity-of-pablo-escobars-wealth-2015-9": { "description": "The \"King of Cocaine\" was the son of a poor...", "favicon_url": "http://static3.businessinsider.com/assets/images/us/favicons/apple-touch-icon-57x57.png?v=BI-US-2016-03-31", "images": [ { "entropy": 1, "height": 500, "url": "http://static1.businessinsider.com/image/5600373d9dd7cc1d008bbdd9-1190-625/10-facts-reveal-the-absurdity-of-pablo-escobars-wealth.jpg", "width": 500 } ], "original_url": "http://www.businessinsider.com/10-facts-that-prove-the-absurdity-of-pablo-escobars-wealth-2015-9", "title": "10 facts reveal the absurdity of Pablo Escobar's wealth", "url": "http://www.businessinsider.com/10-facts-that-prove-the-absurdity-of-pablo-escobars-wealth-2015-9" }, "http://www.businessinsider.com/most-meaningful-jobs-in-america-2015-7": { "description": "These jobs make the world a better place.", "favicon_url": "http://static3.businessinsider.com/assets/images/us/favicons/apple-touch-icon-57x57.png?v=BI-US-2016-03-31", "images": [ { "entropy": 1, "height": 500, "url": "http://static4.businessinsider.com/image/55afd0d2371d2223518b8179-1190-625/the-13-most-meaningful-jobs-in-america.jpg", "width": 500 } ], "original_url": "http://www.businessinsider.com/most-meaningful-jobs-in-america-2015-7", "title": "The 13 most meaningful jobs in America", "url": "http://www.businessinsider.com/most-meaningful-jobs-in-america-2015-7" }, "http://www.littlethings.com/dad-builds-wheelchair/": { "description": "When Evelyn Moore was just 4 months old, she was diagnosed with neuroblastoma. The tiny toddler has gone through chemotherapy eight times. Evelyn has been in remission for the last three months, but the toddler is paralyzed from the arms down due to a spinal tumor. After seeing how expensive custom-made wheelchairs are, Evelyn’s dad,...", "favicon_url": "http://cdn7.littlethings.com/app/themes/littlethings/img/icons/093015/touch.png", "images": [ { "entropy": 1, "height": 500, "url": "http://cdn8.littlethings.com/app/uploads/2016/08/100-Wheelchair-A.jpg", "width": 500 } ], "original_url": "http://www.littlethings.com/dad-builds-wheelchair/", "title": "Worried Dad Builds His Sick Daughter An Awesome Custom Wheelchair For $100", "url": "http://www.littlethings.com/dad-builds-wheelchair/" }, "http://www.littlethings.com/restored-queen-anne-house/": { "description": "For a long time, a house in York, PA, stood falling apart, its shingles and siding crumbling, and its once-beautiful details being slowly hidden under layers of dust and rubble. It didn’t always look like the sagging, washed-out shadow of its former self, though. Built in 1887, this house was designed in the Queen Anne style,...", "favicon_url": "http://cdn7.littlethings.com/app/themes/littlethings/img/icons/093015/touch.png", "images": [ { "entropy": 1, "height": 500, "url": "http://cdn8.littlethings.com/app/uploads/2016/08/anne-16.jpg", "width": 500 } ], "original_url": "http://www.littlethings.com/restored-queen-anne-house/", "title": "‘Uninhabitable’ 1887 House Is Lovingly Restored To Its Former Glory", "url": "http://www.littlethings.com/restored-queen-anne-house/" }, "http://www.newyorker.com/culture/culture-desk/watching-canadas-biggest-rock-band-say-a-dramatic-goodbye?mbid=rss": { "description": "The Tragically Hip reach the end of a tour in which the band and its fans bid farewell to the band’s lead singer.", "favicon_url": "http://www.newyorker.com/wp-content/assets/dist/img/icon/apple-touch-icon.png", "images": [ { "entropy": 1, "height": 500, "url": "http://www.newyorker.com/wp-content/uploads/2016/08/Marche-WatchingCanadasBiggestRockBandSayaDramaticGoodbye-1200x630-1471646584.jpg", "width": 500 } ], "original_url": "http://www.newyorker.com/culture/culture-desk/watching-canadas-biggest-rock-band-say-a-dramatic-goodbye?mbid=rss", "title": "Watching Canada’s Biggest Rock Band Say a Dramatic Goodbye - The New Yorker", "url": "http://www.newyorker.com/culture/culture-desk/watching-canadas-biggest-rock-band-say-a-dramatic-goodbye?mbid=rss" }, "http://www.newyorker.com/humor/daily-shouts/everything-i-am-afraid-might-happen-if-i-ask-new-acquaintances-to-get-coffee": { "description": "No. 1: they will say no.", "favicon_url": "http://www.newyorker.com/wp-content/assets/dist/img/icon/apple-touch-icon.png", "images": [ { "entropy": 1, "height": 500, "url": "http://www.newyorker.com/wp-content/uploads/2015/07/Cantor-Coffee-Shouts-1200-630-20144041.jpg", "width": 500 } ], "original_url": "http://www.newyorker.com/humor/daily-shouts/everything-i-am-afraid-might-happen-if-i-ask-new-acquaintances-to-get-coffee", "title": "Everything I Am Afraid Might Happen If I Ask New Acquaintances to Get Coffee - The New Yorker", "url": "http://www.newyorker.com/humor/daily-shouts/everything-i-am-afraid-might-happen-if-i-ask-new-acquaintances-to-get-coffee" }, "http://www.npr.org/sections/13.7/2016/08/22/490847797/why-do-we-judge-parents-for-putting-kids-at-perceived-but-unreal-risk": { "description": "Tania Lombrozo looks at research published Monday showing people's factual judgment of how much danger a child is in while a parent is away varies according to the extent of their moral outrage.", "favicon_url": "http://www.npr.org/favicon.ico", "images": [ { "entropy": 1, "height": 500, "url": "https://media.npr.org/assets/img/2016/08/21/gettyimages-524408471_wide-8b5d40954475d07e32a3210a9747f099110a0cf7.jpg?s=1400", "width": 500 } ], "original_url": "http://www.npr.org/sections/13.7/2016/08/22/490847797/why-do-we-judge-parents-for-putting-kids-at-perceived-but-unreal-risk", "title": "Why Do We Judge Parents For Putting Kids At Perceived — But Unreal — Risk?", "url": "http://www.npr.org/sections/13.7/2016/08/22/490847797/why-do-we-judge-parents-for-putting-kids-at-perceived-but-unreal-risk" }, "http://www.npr.org/sections/thetorch/2016/08/21/490818961/u-s-women-are-the-biggest-winners-in-rio-olympics?utm_medium=RSS&utm_campaign=news": { "description": "As in London four years ago, American women are taking home more medals than their male counterparts. It's a trend that's likely to continue.", "favicon_url": "http://www.npr.org/favicon.ico", "images": [ { "entropy": 1, "height": 500, "url": "https://media.npr.org/assets/img/2016/08/21/women-s-basketball-8-21-16_wide-2ab28e9756b75596410103c55fa115390245d17e.jpg?s=1400", "width": 500 } ], "original_url": "http://www.npr.org/sections/thetorch/2016/08/21/490818961/u-s-women-are-the-biggest-winners-in-rio-olympics?utm_medium=RSS&utm_campaign=news", "title": "U.S. Women Are The Biggest Winners At The Rio Olympics", "url": "http://www.npr.org/sections/thetorch/2016/08/21/490818961/u-s-women-are-the-biggest-winners-in-rio-olympics?utm_medium=RSS&utm_campaign=news" }, "http://www.smithsonianmag.com/history/for-40-years-this-russian-family-was-cut-off-from-all-human-contact-unaware-of-world-war-ii-7354256/?no-ist": { "description": "In 1978, Soviet geologists prospecting in the wilds of Siberia discovered a family of six, lost in the taiga", "favicon_url": "http://static.media.smithsonianmag.com/img/icons/Smithsonian-com-Icon2-60.png", "images": [ { "entropy": 1, "height": 500, "url": "http://thumbs.media.smithsonianmag.com//filer/paleo-40years-russia-631x300.jpg__1072x720_q85_crop.jpg", "width": 500 } ], "original_url": "http://www.smithsonianmag.com/history/for-40-years-this-russian-family-was-cut-off-from-all-human-contact-unaware-of-world-war-ii-7354256/?no-ist", "title": "For 40 Years, This Russian Family Was Cut Off From All Human Contact, Unaware of World War II", "url": "http://www.smithsonianmag.com/history/for-40-years-this-russian-family-was-cut-off-from-all-human-contact-unaware-of-world-war-ii-7354256/?no-ist" }, "https://smiledirectclub.com/?utm_source=digg&utm_medium=referral&utm_campaign=homepage-web": { "description": "Get straighter teeth from home with SmileDirectClub Invisible Aligners for up to 70% less than other treatment options!", "favicon_url": "https://smiledirectclub.com/static/favicon.ico", "images": [ { "entropy": 1, "height": 500, "url": "https://smiledirectclub.com/static/lib/images/SCC_SmilingAligners05.jpg", "width": 500 } ], "original_url": "https://smiledirectclub.com/?utm_source=digg&utm_medium=referral&utm_campaign=homepage-web", "title": "SmileDirectClub: Invisible Aligners Done at Home", "url": "https://smiledirectclub.com/?utm_source=digg&utm_medium=referral&utm_campaign=homepage-web" }, "https://www.theguardian.com/sport/2016/aug/21/lynsey-sharp-caster-semenya-rio-2016-olympics": { "description": "A tearful Lynsey Sharp said the decision to overturn rules on testosterone suppression made competing against the women’s Olympic 800m champion, Caster Semenya, difficult", "favicon_url": "https://assets.guim.co.uk/images/favicons/451963ac2e23633472bf48e2856d3f04/152x152.png", "images": [ { "entropy": 1, "height": 500, "url": "https://i.guim.co.uk/img/media/bff5607963ec533189b61948c97f8ae7177bdf84/0_0_2144_1286/2144.jpg?w=1200&h=630&q=55&auto=format&usm=12&fit=crop&bm=normal&ba=bottom%2Cleft&blend64=aHR0cHM6Ly91cGxvYWRzLmd1aW0uY28udWsvMjAxNi8wNS8yNS9vdmVybGF5LWxvZ28tMTIwMC05MF9vcHQucG5n&s=4bcb563466d0f979e10017cc2210365f", "width": 500 } ], "original_url": "https://www.theguardian.com/sport/2016/aug/21/lynsey-sharp-caster-semenya-rio-2016-olympics", "title": "Tearful Lynsey Sharp says rule change makes racing Caster Semenya difficult", "url": "https://www.theguardian.com/sport/2016/aug/21/lynsey-sharp-caster-semenya-rio-2016-olympics" }, "https://www.theguardian.com/us-news/ng-interactive/2015/jun/01/the-counted-police-killings-us-database": { "description": "The Guardian has been counting the people killed by US law enforcement agencies since 2015. Read their stories and contribute to our ongoing, crowdsourced project", "favicon_url": "https://assets.guim.co.uk/images/favicons/451963ac2e23633472bf48e2856d3f04/152x152.png", "images": [ { "entropy": 1, "height": 500, "url": "https://i.guim.co.uk/img/media/41a44039259cad6a2aff2f332da5ab7f85afc6c7/0_0_2560_1535/2560.jpg?w=1200&h=630&q=55&auto=format&usm=12&fit=crop&bm=normal&ba=bottom%2Cleft&blend64=aHR0cHM6Ly91cGxvYWRzLmd1aW0uY28udWsvMjAxNi8wNS8yNS9vdmVybGF5LWxvZ28tMTIwMC05MF9vcHQucG5n&s=d7803d58d4ba734a4206403f831bb225", "width": 500 } ], "original_url": "https://www.theguardian.com/us-news/ng-interactive/2015/jun/01/the-counted-police-killings-us-database", "title": "The Counted: people killed by police in the United States – interactive", "url": "https://www.theguardian.com/us-news/ng-interactive/2015/jun/01/the-counted-police-killings-us-database" } } ```

The only seemingly "bad" result was the homepage of drudgereport.com, which lacked a description and images metadata.

**Output:** :-1: `http://drudgereport.com/` - "description" is required - "images" must contain at least 1 items :+1: `http://qz.com/252456/what-it-feels-like-to-be-the-last-generation-to-remember-life-before-the-internet/` :+1: `http://qz.com/767812/millennial-whoop/` :+1: `http://tranquilmonkey.com/hunter-s-thompsons-extraordinary-letter-on-finding-your-purpose/` :+1: `http://www.businessinsider.com/10-facts-that-prove-the-absurdity-of-pablo-escobars-wealth-2015-9` :+1: `http://www.businessinsider.com/most-meaningful-jobs-in-america-2015-7` :+1: `http://www.littlethings.com/dad-builds-wheelchair/` :+1: `http://www.littlethings.com/restored-queen-anne-house/` :+1: `http://www.newyorker.com/culture/culture-desk/watching-canadas-biggest-rock-band-say-a-dramatic-goodbye?mbid=rss` :+1: `http://www.newyorker.com/humor/daily-shouts/everything-i-am-afraid-might-happen-if-i-ask-new-acquaintances-to-get-coffee` :+1: `http://www.npr.org/sections/13.7/2016/08/22/490847797/why-do-we-judge-parents-for-putting-kids-at-perceived-but-unreal-risk` :+1: `http://www.npr.org/sections/thetorch/2016/08/21/490818961/u-s-women-are-the-biggest-winners-in-rio-olympics?utm_medium=RSS&utm_campaign=news` :+1: `http://www.smithsonianmag.com/history/for-40-years-this-russian-family-was-cut-off-from-all-human-contact-unaware-of-world-war-ii-7354256/?no-ist` :+1: `https://smiledirectclub.com/?utm_source=digg&utm_medium=referral&utm_campaign=homepage-web` :+1: `https://www.theguardian.com/sport/2016/aug/21/lynsey-sharp-caster-semenya-rio-2016-olympics` :+1: `https://www.theguardian.com/us-news/ng-interactive/2015/jun/01/the-counted-police-killings-us-database`
pdehaan commented 7 years ago

Actually, since our Tippy-Top-Sites is optimized for top level domains, it expectedly behaves pretty poorly:

http://www.baidu.com/ (with "www.")

$ http https://page-metadata-service.stage.mozaws.net/v1/metadata urls:='["http://www.baidu.com/"]' -j -v

HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 241
Content-Type: application/json; charset=utf-8
Date: Tue, 30 Aug 2016 00:03:02 GMT
ETag: W/"f1-pbJN0Ov1duzEA824OTKFTQ"

{
    "request_error": "",
    "url_errors": {},
    "urls": {
        "http://www.baidu.com/": {
            "favicon_url": "http://www.baidu.com/img/baidu.svg",
            "images": [],
            "original_url": "http://www.baidu.com/",
            "title": "百度一下,你就知道",
            "url": "http://www.baidu.com/"
        }
    }
}

http://baidu.com/ (no "www.")

$ http https://page-metadata-service.stage.mozaws.net/v1/metadata urls:='["http://baidu.com/"]' -j

HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 185
Content-Type: application/json; charset=utf-8
Date: Tue, 30 Aug 2016 00:03:11 GMT
ETag: W/"b9-POPHexWtNuza1L/cTgep7Q"

{
    "request_error": "",
    "url_errors": {},
    "urls": {
        "http://baidu.com/": {
            "favicon_url": "http://baidu.com/favicon.ico",
            "images": [],
            "original_url": "http://baidu.com/",
            "url": "http://baidu.com/"
        }
    }
}
Tippy-Top-Sites vs Fathom (click to expand): ``` http://allrecipes.com/ - "images" must contain at least 1 items http://baidu.com/ - "description" is required - "images" must contain at least 1 items - "title" is required http://www.adobe.com/ - "images" must contain at least 1 items http://www.adp.com/ - "images" must contain at least 1 items http://www.ancestry.com/ - "images" must contain at least 1 items http://www.bbc.com/ - "images" must contain at least 1 items https://www.aa.com/ - "images" must contain at least 1 items https://www.americanexpress.com - "images" must contain at least 1 items https://www.att.com/ - "images" must contain at least 1 items https://www.bankofamerica.com - "description" is required - "images" must contain at least 1 items http://drudgereport.com/ - "description" is required - "images" must contain at least 1 items http://www.deviantart.com/ - "images" must contain at least 1 items http://www.ebay.com - "images" must contain at least 1 items http://www.fedex.com/ - "images" must contain at least 1 items https://www.discover.com/ - "images" must contain at least 1 items https://www.discovercard.com/ - "images" must contain at least 1 items https://www.etsy.com/ - "images" must contain at least 1 items https://www.expedia.com/ - "images" must contain at least 1 items https://www.fitbit.com/ - "images" must contain at least 1 items http://images.google.com/ - "images" must contain at least 1 items http://www.gap.com/ - "images" must contain at least 1 items http://www.google.com/ - "images" must contain at least 1 items http://www.homedepot.com/ - "images" must contain at least 1 items http://www.huffingtonpost.com/ - "images" must contain at least 1 items http://www.ikea.com/ - "description" is required - "images" must contain at least 1 items https://www.groupon.com/ - "images" must contain at least 1 items http://mashable.com/stories/ - "images" must contain at least 1 items http://www.intuit.com/ - "images" must contain at least 1 items http://www.jcpenney.com/ - "images" must contain at least 1 items http://www.kohls.com - "images" must contain at least 1 items http://www.lowes.com/ - "images" must contain at least 1 items http://www.microsoft.com/ - "description" is required - "images" must contain at least 1 items http://www.msn.com/ - "images" must contain at least 1 items https://login.microsoftonline.com/ - "description" is required - "images" must contain at least 1 items https://mail.live.com - "images" must contain at least 1 items https://www.irs.gov/ - "description" is required - "images" must contain at least 1 items https://www.linkedin.com/ - "images" must contain at least 1 items https://www.netflix.com/ - "images" must contain at least 1 items http://buzzlie.com/ - "description" is required - "images" must contain at least 1 items http://ca.gov/ - "description" is required - "images" must contain at least 1 items http://conservativetribune.com/ - "images" must contain at least 1 items http://craigslist.org/ - "images" must contain at least 1 items http://www.bing.com/ - "images" must contain at least 1 items http://www.blackboard.com/ - "images" must contain at least 1 items http://www.cbsnews.com/ - "images" must contain at least 1 items http://www.cnn.com - "images" must contain at least 1 items http://www.comcast.net/ - "description" is required - "images" must contain at least 1 items http://www.costco.com/ - "images" must contain at least 1 items https://www.chase.com - "images" must contain at least 1 items ```

I think if we can't find a suitable <img/> tag via OpenGraph or Twitter meta tags, or other page-metadata-parser rules, we may want to maybe think about taking a screenshot and using that as a preview. Not sure if there'd be any risk of leaking personal information between users though. Fathom or the metadata server couldn't do any logins or anything, but any specially crafted URLs could always have some sort of tricky areas.

pdehaan commented 7 years ago

A crude little module which lets you point to a URL, try and extract any RSS feed URL, then scrapes the RSS items, so we can pass them to the metadata server: https://github.com/pdehaan/fetch-site-rss

`$ node index http://www.latimes.com` ``` json { "http://www.latimes.com": { "description": "The LA Times is a leading source of breaking news, entertainment, sports, politics, and more for Southern California and the world.", "favicon_url": "http://www.trbas.com/jive/prod/common/images/lanews-apple-touch-icon.1q2w3_9ffdb679907f116af126c65ff1edb27a.png", "images": [ { "entropy": 1, "height": 500, "url": "http://www.trbimg.com/img-56fd643a/turbine/la-l-a-times-logo-20160331/600", "width": 500 } ], "original_url": "http://www.latimes.com", "title": "Los Angeles Times - California, national and world news - Los Angeles Times", "url": "http://www.latimes.com" }, "http://www.latimes.com/la-et-jc-oprah-love-warrior-20160906-snap-story.html": { "description": "Oprah Winfrey on Tuesday unveiled the latest pick for her book club: \"Love Warrior,\" a memoir by Glennon Doyle Melton, the author, motivational speaker and founder of the online community Momastery.", "favicon_url": "http://www.trbas.com/jive/prod/common/images/lanews-apple-touch-icon.1q2w3_9ffdb679907f116af126c65ff1edb27a.png", "images": [ { "entropy": 1, "height": 500, "url": "http://www.trbimg.com/img-57cf1127/turbine/la-et-jc-oprah-love-warrior-20160906-snap", "width": 500 } ], "original_url": "http://www.latimes.com/la-et-jc-oprah-love-warrior-20160906-snap-story.html", "title": "Oprah ramps up 2016 book club with second pick in two months: 'Love Warrior' by Glennon Doyle Melton", "url": "http://www.latimes.com/la-et-jc-oprah-love-warrior-20160906-snap-story.html" }, "http://www.latimes.com/la-framed-live-chat-20160906-story.html": { "favicon_url": "https://static.xx.fbcdn.net/rsrc.php/yV/r/hzMapiNYYpW.ico", "images": [], "original_url": "http://www.latimes.com/la-framed-live-chat-20160906-story.html", "title": "Live chat with “Framed” series reporter", "url": "http://www.latimes.com/la-framed-live-chat-20160906-story.html" }, "http://www.latimes.com/la-ig-tom-ford-q-and-a-20160906-snap-story.html": { "description": "The day before he’s set to kick off New York Fashion Week by sending his fall/winter 2016 collections down the runway and right into retail, fashion designer and filmmaker Tom Ford took a break from model and VIP fittings at his Madison Avenue boutique to talk about the logistics of actually pulling off a “see now/buy now” collection, how he juggles his two, high-profile careers and why he’s still looking for a place to live in Los Angeles – even though everyone thinks he bought a $50-million Beverly Hills mansion out from under Jay Z and Beyoncé.", "favicon_url": "http://www.trbas.com/jive/prod/common/images/lanews-apple-touch-icon.1q2w3_9ffdb679907f116af126c65ff1edb27a.png", "images": [ { "entropy": 1, "height": 500, "url": "http://www.trbimg.com/img-57cf0d35/turbine/la-ig-tom-ford-q-and-a-20160906-snap", "width": 500 } ], "original_url": "http://www.latimes.com/la-ig-tom-ford-q-and-a-20160906-snap-story.html", "title": "On the eve of New York Fashion Week, Tom Ford talks about his move to L.A. and the future of his brand", "url": "http://www.latimes.com/la-ig-tom-ford-q-and-a-20160906-snap-story.html" }, "http://www.latimes.com/la-me-ln-metrolink-crash-20160906-snap-story.html": { "description": "A  Metrolink  train hit a vehicle in Sun Valley on Tuesday morning near San Fernando Boulevard, officials said.", "favicon_url": "http://www.trbas.com/jive/prod/common/images/lanews-apple-touch-icon.1q2w3_9ffdb679907f116af126c65ff1edb27a.png", "images": [ { "entropy": 1, "height": 500, "url": "http://www.trbimg.com/img-57cf1c2c/turbine/la-me-ln-metrolink-crash-20160906-snap", "width": 500 } ], "original_url": "http://www.latimes.com/la-me-ln-metrolink-crash-20160906-snap-story.html", "title": "Metrolink train crashes into truck in Sun Valley", "url": "http://www.latimes.com/la-me-ln-metrolink-crash-20160906-snap-story.html" }, "http://www.latimes.com/la-me-ln-rave-regulation-20160906-snap-story.html": { "description": "The Labor Day weekend saw another major rave in Southern California. While no deaths were reported, more than 400 people were arrested at and five people were sent to hospitals from Nocturnal Wonderland, which drew more than 67,000 people at the San Manuel Amphitheater.", "favicon_url": "http://www.trbas.com/jive/prod/common/images/lanews-apple-touch-icon.1q2w3_9ffdb679907f116af126c65ff1edb27a.png", "images": [ { "entropy": 1, "height": 500, "url": "http://www.trbimg.com/img-57cef2ce/turbine/la-me-ln-rave-regulation-20160906-snap", "width": 500 } ], "original_url": "http://www.latimes.com/la-me-ln-rave-regulation-20160906-snap-story.html", "title": "New laws regulating raves don't always apply, sometimes with tragic results", "url": "http://www.latimes.com/la-me-ln-rave-regulation-20160906-snap-story.html" }, "http://www.latimes.com/la-me-san-bernardino-conference-20160906-snap-story.html": { "description": "First responders from various cities attended a conference Tuesday aimed at better serving victims of mass-casualty incidents such as the San Bernardino terror attack.", "favicon_url": "http://www.trbas.com/jive/prod/common/images/lanews-apple-touch-icon.1q2w3_9ffdb679907f116af126c65ff1edb27a.png", "images": [ { "entropy": 1, "height": 500, "url": "http://www.trbimg.com/img-57cf0ea3/turbine/la-me-san-bernardino-conference-20160906-snap", "width": 500 } ], "original_url": "http://www.latimes.com/la-me-san-bernardino-conference-20160906-snap-story.html", "title": "First responders use lessons from San Bernardino terrorist attack to better help other victims", "url": "http://www.latimes.com/la-me-san-bernardino-conference-20160906-snap-story.html" }, "http://www.latimes.com/la-na-trailguide-updates-09062016-htmlstory.html": { "description": "Campaign 2016 updates: Clinton campaign puts focus on national security -- and Trump's temperament Sept. 6, 2016, 12:37 p.m. Hillary Clinton heads to Tampa, Fla., for a rally Tuesday. Donald Trump will stop in Virginia before going to his rally in North Carolina. Major national security address...", "favicon_url": "http://www.trbas.com/jive/prod/common/images/lanews-apple-touch-icon.1q2w3_9ffdb679907f116af126c65ff1edb27a.png", "images": [ { "entropy": 1, "height": 500, "url": "http://www.trbimg.com/img-57ceae18/turbine/la-na-trailguide-updates-09062016", "width": 500 } ], "original_url": "http://www.latimes.com/la-na-trailguide-updates-09062016-htmlstory.html", "title": "Campaign 2016 updates: Clinton campaign puts focus on national security -- and Trump's temperament", "url": "http://www.latimes.com/la-na-trailguide-updates-09062016-htmlstory.html" }, "http://www.latimes.com/la-na-trailguide-updates-1473184137-htmlstory.html": { "description": "Update on 'Campaign 2016 updates: Clinton campaign puts focus on national security -- and Trump's temperament'", "favicon_url": "http://www.trbas.com/jive/prod/common/images/lanews-apple-touch-icon.1q2w3_9ffdb679907f116af126c65ff1edb27a.png", "images": [ { "entropy": 1, "height": 500, "url": "http://www.trbimg.com/img-56fd643a/turbine/la-l-a-times-logo-20160331/600", "width": 500 } ], "original_url": "http://www.latimes.com/la-na-trailguide-updates-1473184137-htmlstory.html", "title": "Watch: Donald Trump talks national security in Virginia Beach", "url": "http://www.latimes.com/la-na-trailguide-updates-1473184137-htmlstory.html" }, "http://www.latimes.com/la-na-trailguide-updates-1473186715-htmlstory.html": { "description": "Update on 'Campaign 2016 updates: Clinton campaign puts focus on national security -- and Trump's temperament'", "favicon_url": "http://www.trbas.com/jive/prod/common/images/lanews-apple-touch-icon.1q2w3_9ffdb679907f116af126c65ff1edb27a.png", "images": [ { "entropy": 1, "height": 500, "url": "http://www.trbimg.com/img-56fd643a/turbine/la-l-a-times-logo-20160331/600", "width": 500 } ], "original_url": "http://www.latimes.com/la-na-trailguide-updates-1473186715-htmlstory.html", "title": "Update on: Campaign 2016 updates: Clinton campaign puts focus on national security -- and Trump's temperament", "url": "http://www.latimes.com/la-na-trailguide-updates-1473186715-htmlstory.html" }, "http://www.latimes.com/la-na-trailguide-updates-donald-trump-holds-town-hall-with-no-1473189370-htmlstory.html": { "description": "Update on 'Campaign 2016 updates: Clinton campaign puts focus on national security -- and Trump's temperament'", "favicon_url": "http://www.trbas.com/jive/prod/common/images/lanews-apple-touch-icon.1q2w3_9ffdb679907f116af126c65ff1edb27a.png", "images": [ { "entropy": 1, "height": 500, "url": "http://www.trbimg.com/img-57cf1c56/turbine/la-na-trailguide-updates-donald-trump-holds-town-hall-with-no-1473189370", "width": 500 } ], "original_url": "http://www.latimes.com/la-na-trailguide-updates-donald-trump-holds-town-hall-with-no-1473189370-htmlstory.html", "title": "Donald Trump holds 'town hall' with no audience questions", "url": "http://www.latimes.com/la-na-trailguide-updates-donald-trump-holds-town-hall-with-no-1473189370-htmlstory.html" }, "http://www.latimes.com/la-na-trailguide-updates-pence-headed-to-california-to-speak-1473186298-htmlstory.html": { "description": "Update on 'Campaign 2016 updates: Clinton campaign puts focus on national security -- and Trump's temperament'", "favicon_url": "http://www.trbas.com/jive/prod/common/images/lanews-apple-touch-icon.1q2w3_9ffdb679907f116af126c65ff1edb27a.png", "images": [ { "entropy": 1, "height": 500, "url": "http://www.trbimg.com/img-57cf0bb3/turbine/la-na-trailguide-updates-pence-headed-to-california-to-speak-1473186298", "width": 500 } ], "original_url": "http://www.latimes.com/la-na-trailguide-updates-pence-headed-to-california-to-speak-1473186298-htmlstory.html", "title": "Pence headed to California to speak -- and raise money", "url": "http://www.latimes.com/la-na-trailguide-updates-pence-headed-to-california-to-speak-1473186298-htmlstory.html" }, "http://www.latimes.com/la-na-trailguide-updates-watch-hillary-clinton-rallies-1473184539-htmlstory.html": { "description": "Update on 'Campaign 2016 updates: Clinton campaign puts focus on national security -- and Trump's temperament'", "favicon_url": "http://www.trbas.com/jive/prod/common/images/lanews-apple-touch-icon.1q2w3_9ffdb679907f116af126c65ff1edb27a.png", "images": [ { "entropy": 1, "height": 500, "url": "http://www.trbimg.com/img-56fd643a/turbine/la-l-a-times-logo-20160331/600", "width": 500 } ], "original_url": "http://www.latimes.com/la-na-trailguide-updates-watch-hillary-clinton-rallies-1473184539-htmlstory.html", "title": "Watch: Hillary Clinton rallies supporters in Tampa, Fla.", "url": "http://www.latimes.com/la-na-trailguide-updates-watch-hillary-clinton-rallies-1473184539-htmlstory.html" }, "http://www.latimes.com/la-ol-le-school-20160906-snap-story.html": { "description": "To the editor: Our children have been attending DOC schools for the past six years, and each year our family has to adapt to make it work; I've made arrangements with my employer to work alternate schedules, we've carpooled, and now our children are old enough to safely navigate the public bus or the city shuttle.", "favicon_url": "http://www.trbas.com/jive/prod/common/images/lanews-apple-touch-icon.1q2w3_9ffdb679907f116af126c65ff1edb27a.png", "images": [ { "entropy": 1, "height": 500, "url": "http://www.trbimg.com/img-57ced363/turbine/la-ol-le-school-20160906-snap", "width": 500 } ], "original_url": "http://www.latimes.com/la-ol-le-school-20160906-snap-story.html", "title": "School-choice statute is in the crossfire", "url": "http://www.latimes.com/la-ol-le-school-20160906-snap-story.html" }, "http://www.latimes.com/la-sp-angels-mailbag-20160906-snap-htmlstory.html": { "description": "Hello, hello, Angels fans. This mailbag is entering your consciousness one day later than normal because of Monday’s holiday. The Angels played well in the last week, but suffered a terrible shock Sunday in Seattle, when starter Matt Shoemaker was struck in the head by a line drive and rushed to...", "favicon_url": "http://www.trbas.com/jive/prod/common/images/lanews-apple-touch-icon.1q2w3_9ffdb679907f116af126c65ff1edb27a.png", "images": [ { "entropy": 1, "height": 500, "url": "http://www.trbimg.com/img-57cf19f1/turbine/la-sp-angels-mailbag-20160906-snap", "width": 500 } ], "original_url": "http://www.latimes.com/la-sp-angels-mailbag-20160906-snap-htmlstory.html", "title": "Angels mailbag: To root for the team to win, or lose?", "url": "http://www.latimes.com/la-sp-angels-mailbag-20160906-snap-htmlstory.html" }, "http://www.latimes.com/la-sp-miesha-tate-hike--20160906-snap-htmlstory.html": { "description": "Miesha Tate spent Labor Day hiking in the mountains of Nevada — and helping an injured little girl. The former UFC women’s bantamweight champion wrote on Facebook that while hiking along Mary Jane Falls on Mt. Charleston she encountered a 6-year-old who had broken her arm along the way and whose...", "favicon_url": "http://www.trbas.com/jive/prod/common/images/lanews-apple-touch-icon.1q2w3_9ffdb679907f116af126c65ff1edb27a.png", "images": [ { "entropy": 1, "height": 500, "url": "http://www.trbimg.com/img-57cf055b/turbine/la-sp-miesha-tate-hike--20160906-snap", "width": 500 } ], "original_url": "http://www.latimes.com/la-sp-miesha-tate-hike--20160906-snap-htmlstory.html", "title": "UFC's Miesha Tate describes carrying an injured child down a mountain to safety", "url": "http://www.latimes.com/la-sp-miesha-tate-hike--20160906-snap-htmlstory.html" }, "http://www.latimes.com/la-sp-vi-boys-basketball-chino-hills-bishop-montgomery-fairfax-set-for-rolling-hills-state-preview-classic-20160906-story.html": { "description": "Former Fairfax Coach Harvey Kitani, now the head basketball coach at Rolling Hills Prep, has taken over running his State Preview Classic tournament and has put together another impressive group of teams.", "favicon_url": "http://www.trbas.com/jive/prod/common/images/lanews-apple-touch-icon.1q2w3_9ffdb679907f116af126c65ff1edb27a.png", "images": [ { "entropy": 1, "height": 500, "url": "http://www.trbimg.com/img-56fd643a/turbine/la-l-a-times-logo-20160331/600", "width": 500 } ], "original_url": "http://www.latimes.com/la-sp-vi-boys-basketball-chino-hills-bishop-montgomery-fairfax-set-for-rolling-hills-state-preview-classic-20160906-story.html", "title": "Boys' basketball: Chino Hills, Bishop Montgomery, Fairfax set for Rolling Hills State Preview Classic", "url": "http://www.latimes.com/la-sp-vi-boys-basketball-chino-hills-bishop-montgomery-fairfax-set-for-rolling-hills-state-preview-classic-20160906-story.html" }, "http://www.latimes.com/la-sp-vi-football-narbonne-palos-verdes-to-be-televised-by-time-warner-cable-20160906-story.html": { "description": "Friday's battle of the unbeatens between Narbonne (2-0) and Palos Verdes (2-0) will be televised live by Time Warner Cable on Channel 84 and also on the web at twccommunity.com .", "favicon_url": "http://www.trbas.com/jive/prod/common/images/lanews-apple-touch-icon.1q2w3_9ffdb679907f116af126c65ff1edb27a.png", "images": [ { "entropy": 1, "height": 500, "url": "http://www.trbimg.com/img-56fd643a/turbine/la-l-a-times-logo-20160331/600", "width": 500 } ], "original_url": "http://www.latimes.com/la-sp-vi-football-narbonne-palos-verdes-to-be-televised-by-time-warner-cable-20160906-story.html", "title": "Football: Narbonne-Palos Verdes to be televised by Time Warner Cable", "url": "http://www.latimes.com/la-sp-vi-football-narbonne-palos-verdes-to-be-televised-by-time-warner-cable-20160906-story.html" }, "http://www.latimes.com/la-sp-vi-football-so-far-cathedral-s-two-quarterback-system-is-working-just-fine-20160906-story.html": { "description": "Cathedral Coach Kevin Pearson is sticking with his plan to play senior Andrew Tovar and freshman Bryce Young together at quarterack, and so far it's working.", "favicon_url": "http://www.trbas.com/jive/prod/common/images/lanews-apple-touch-icon.1q2w3_9ffdb679907f116af126c65ff1edb27a.png", "images": [ { "entropy": 1, "height": 500, "url": "http://www.trbimg.com/img-56fd643a/turbine/la-l-a-times-logo-20160331/600", "width": 500 } ], "original_url": "http://www.latimes.com/la-sp-vi-football-so-far-cathedral-s-two-quarterback-system-is-working-just-fine-20160906-story.html", "title": "Football: So far, Cathedral's two-quarterback system is working just fine", "url": "http://www.latimes.com/la-sp-vi-football-so-far-cathedral-s-two-quarterback-system-is-working-just-fine-20160906-story.html" }, "http://www.latimes.com/la-sp-vi-girls-volleyball-chaminade-to-face-granada-hills-on-wednesday-20160906-story.html": { "description": "There's a good nonleague girls' volleyball match set for Wednesday, with City Section Division I title favorite Granada Hills hosting Chaminade.", "favicon_url": "http://www.trbas.com/jive/prod/common/images/lanews-apple-touch-icon.1q2w3_9ffdb679907f116af126c65ff1edb27a.png", "images": [ { "entropy": 1, "height": 500, "url": "http://www.trbimg.com/img-56fd643a/turbine/la-l-a-times-logo-20160331/600", "width": 500 } ], "original_url": "http://www.latimes.com/la-sp-vi-girls-volleyball-chaminade-to-face-granada-hills-on-wednesday-20160906-story.html", "title": "Girls' volleyball: Chaminade to face Granada Hills on Wednesday", "url": "http://www.latimes.com/la-sp-vi-girls-volleyball-chaminade-to-face-granada-hills-on-wednesday-20160906-story.html" } } ```

But this should make it a lot easier to scrape a bunch of the tippy-top-sites or Alexa Top 100 sites where we don't necessarily care about the homepage and want to test random inner content pages.

pdehaan commented 7 years ago

... or maybe not for tippy-top-sites. I'm seeing errors on about 144 of the [173 total] URLs because it can't find an RSS feed:

`$ node index2 > debug.json` ``` Unable to find RSS feed for http://www.kohls.com Unable to find RSS feed for https://www.instructure.com/ Unable to find RSS feed for http://diply.com/ Unable to find RSS feed for http://www.google.com/ Unable to find RSS feed for http://images.google.com/ Unable to find RSS feed for https://www.ups.com/ Unable to find RSS feed for https://www.usaa.com/ Unable to find RSS feed for http://faithtap.com/ Unable to find RSS feed for http://gfycat.com/ Unable to find RSS feed for https://www.instagram.com/ Unable to find RSS feed for https://login.microsoftonline.com/ Unable to find RSS feed for https://www.taboola.com/ Unable to find RSS feed for https://www.twitch.tv/ Unable to find RSS feed for http://www.mapquest.com/ Unable to find RSS feed for http://www.apple.com/ Unable to find RSS feed for https://www.discover.com/ Unable to find RSS feed for https://soundcloud.com/ Unable to find RSS feed for http://www.ancestry.com/ Unable to find RSS feed for http://www.ask.com Unable to find RSS feed for https://www.discovercard.com/ Unable to find RSS feed for https://www.surveymonkey.com/ Unable to find RSS feed for https://www.americanexpress.com Unable to find RSS feed for https://www.usbank.com/ Unable to find RSS feed for http://www.verizon.com/ Unable to find RSS feed for http://www.ikea.com/ Unable to find RSS feed for https://www.linkedin.com/ Unable to find RSS feed for https://www.irs.gov/ Unable to find RSS feed for https://www.usps.com/ Invalid URI "//en.blog.wordpress.com/feed/" Unable to find RSS feed for https://www.wunderground.com/ Unable to find RSS feed for http://drudgereport.com/ Unable to find RSS feed for http://www.fedex.com/ Invalid URI "rss" Unable to find RSS feed for http://www.t-mobile.com/ Unable to find RSS feed for https://www.bankofamerica.com Unable to find RSS feed for https://www.box.com/ Unable to find RSS feed for https://github.com/ Unable to find RSS feed for https://www.tumblr.com/ Unable to find RSS feed for http://www.swagbucks.com/ Unable to find RSS feed for https://www.eventbrite.com/ Unable to find RSS feed for http://www.ign.com/ Unable to find RSS feed for http://www.lowes.com/ Unable to find RSS feed for http://www.realtor.com/ Unable to find RSS feed for http://www.wittyfeed.com Unable to find RSS feed for https://www.capitalone.com/ Unable to find RSS feed for https://www.office.com/ Unable to find RSS feed for http://stackexchange.com/ Unable to find RSS feed for https://www.paypal.com/home Unable to find RSS feed for http://www.trulia.com/ Unable to find RSS feed for http://www.target.com Unable to find RSS feed for https://www.dropbox.com/ Unable to find RSS feed for http://www.hulu.com/ Unable to find RSS feed for http://www.homedepot.com/ Unable to find RSS feed for https://www.airbnb.com/ Unable to find RSS feed for http://www.msn.com/ Unable to find RSS feed for https://www.spotify.com/ Invalid URI "//feeds.feedburner.com/foxnews/latest" Unable to find RSS feed for http://www.foodnetwork.com/ Unable to find RSS feed for http://www.bestbuy.com Unable to find RSS feed for http://ca.gov/ Unable to find RSS feed for http://www.blackboard.com/ Unable to find RSS feed for https://www.expedia.com/ Unable to find RSS feed for http://www.adobe.com/ Unable to find RSS feed for https://www.chase.com Unable to find RSS feed for https://www.groupon.com/ Unable to find RSS feed for http://www.salesforce.com/ Unable to find RSS feed for https://www.tripadvisor.com/ Unable to find RSS feed for http://www.microsoft.com/ Unable to find RSS feed for http://www.sears.com/ Unable to find RSS feed for http://www.thesaurus.com/ Unable to find RSS feed for http://www.cnn.com Unable to find RSS feed for https://www.delta.com/ Unable to find RSS feed for http://www.jcpenney.com/ Invalid URI "/rss/rss.php?id=1002" Unable to find RSS feed for http://www.answers.com Unable to find RSS feed for https://www.wellsfargo.com Unable to find RSS feed for https://www.aa.com/ Unable to find RSS feed for https://www.flickr.com Unable to find RSS feed for https://mail.live.com Unable to find RSS feed for http://patch.com Unable to find RSS feed for http://www.staples.com/ Unable to find RSS feed for http://www.accuweather.com/ Unable to find RSS feed for https://www.att.com/ Unable to find RSS feed for http://www.cbsnews.com/ Unable to find RSS feed for https://www.facebook.com/ Unable to find RSS feed for http://www.zillow.com/ Invalid URI "/rss/VZWPromotions.rss" Unable to find RSS feed for http://www.goodreads.com/ Unable to find RSS feed for https://www.fitbit.com/ Unable to find RSS feed for https://www.washingtonpost.com/regional/ Unable to find RSS feed for http://go.com Unable to find RSS feed for http://www.bbc.com/ Unable to find RSS feed for http://www.bing.com/ Unable to find RSS feed for http://www.cbssports.com/ Unable to find RSS feed for http://www.ebay.com Unable to find RSS feed for http://www.macys.com/ Unable to find RSS feed for http://www.intuit.com/ Unable to find RSS feed for http://craigslist.org/ Unable to find RSS feed for https://www.wikipedia.org/ Unable to find RSS feed for http://www.costco.com/ Cannot read property '#' of undefined Unable to find RSS feed for http://www.ebates.com/ Unable to find RSS feed for https://www.netflix.com/ Unable to find RSS feed for https://online.citi.com/ Unable to find RSS feed for https://twitter.com/ Unable to find RSS feed for https://weather.com/ Unable to find RSS feed for https://www.blogger.com/home Unable to find RSS feed for http://www.comcast.net/ Unable to find RSS feed for http://www.xfinity.com/ Unable to find RSS feed for https://www.glassdoor.com/ Unable to find RSS feed for http://espn.go.com Unable to find RSS feed for http://www.huffingtonpost.com/ Unable to find RSS feed for https://www.reddit.com/ Unable to find RSS feed for http://stackoverflow.com/ Unable to find RSS feed for http://mlb.mlb.com/ Unable to find RSS feed for https://www.southwest.com/ Unable to find RSS feed for https://aws.amazon.com/ Unable to find RSS feed for https://vimeo.com/ Unable to find RSS feed for http://www.deviantart.com/ Unable to find RSS feed for http://www.aol.com/ Unable to find RSS feed for https://www.pinterest.com/ Unable to find RSS feed for http://www.wayfair.com/ Unable to find RSS feed for http://www.usatoday.com/ Unable to find RSS feed for http://www.gap.com/ Unable to find RSS feed for http://www.about.com/ Unable to find RSS feed for http://www.amazon.com/ Unable to find RSS feed for https://www.etsy.com/ Unable to find RSS feed for http://nypost.com/ Unable to find RSS feed for http://allrecipes.com/ Unable to find RSS feed for https://www.yahoo.com/ Unable to find RSS feed for http://www.buzzfeed.com/index Unable to find RSS feed for http://shop.nordstrom.com/ Unable to find RSS feed for http://www.nbcnews.com/ Unable to find RSS feed for http://www.adp.com/ Unable to find RSS feed for http://yelp.com/ Invalid status code: 403 Invalid URI "/newsearch.php?mode=frontpage&searcharea=deals&searchin=first&rss=1" Unable to find RSS feed for http://www.walmart.com/ Unable to find RSS feed for http://www.overstock.com/ Unable to find RSS feed for http://www.imdb.com/ Unable to find RSS feed for https://www.kayak.com/ Unable to find RSS feed for http://www.wsj.com/ Unable to find RSS feed for https://www.youtube.com/ request to http://baidu.com/ failed, reason: connect ETIMEDOUT 220.181.57.217:80 ```

Some of those are bugs in my code where I should be checking for alternate RSS URLs (like "atom" or something, or where I'm not properly normalizing the URL if it's protocol-less or relative), but still...