Closed edwindotcom closed 2 years ago
Related: https://github.com/pdehaan/get-alexa-top-sites :poop::fire:
Not sure how super helpful that is, since it's just the homepage of the top 1m domains, versus scraping an inner page of those sites. But may be a decent-enough starting point.
OK, batching this isn't as easy as I expected. After prepending "https://" to each of the top sites and grabbing the first 20 (which is our per-request proxy limit), I'm just getting HTTP/504 GATEWAY_TIMEOUT errors; even if I drop it down to 10 requests.
$ time http https://metadata.dev.mozaws.net/v1/metadata urls:=@topsites/top-20.json -j -v --follow --timeout=300
POST /v1/metadata HTTP/1.1
Accept: application/json
Accept-Encoding: gzip, deflate
Connection: keep-alive
Content-Length: 233
Content-Type: application/json; charset=utf-8
Host: metadata.dev.mozaws.net
User-Agent: HTTPie/0.9.1
{
"urls": [
"https://google.com",
"https://youtube.com",
"https://facebook.com",
"https://baidu.com",
"https://yahoo.com",
"https://amazon.com",
"https://wikipedia.org",
"https://qq.com",
"https://twitter.com",
"https://google.co.in"
]
}
HTTP/1.1 504 GATEWAY_TIMEOUT
Connection: keep-alive
Content-Length: 0
http https://metadata.dev.mozaws.net/v1/metadata urls:=@topsites/top-20.json
0.16s user
0.07s system
0% cpu
1:00.04 total
Looks like the proxy times the request out after 60s, which doesn't seem to be enough time to process 10 incoming requests in a batch.
OK, some success...
Another sketchy approach we could do would be to add some default Top Sites banner images for any Tippy-Top-Sites which may be missing them. At least that way we'd have a fallback image we could have for google, twitter, wikipedia, etc.
For those of you [like me] who can't read a giant block of JSON, here's a pleasant report from Joi validation:
I actually got a better data dump of "random" URLs from @nchapman which gives much better results (mainly because they're inner article pages and not just homepages).
The only seemingly "bad" result was the homepage of drudgereport.com, which lacked a description
and images
metadata.
Actually, since our Tippy-Top-Sites is optimized for top level domains, it expectedly behaves pretty poorly:
image
.description
.title
(which is wrong, so I think I'm hitting the Fathom/Metadata isn't following redirects issue).http://www.baidu.com/
(with "www.")$ http https://page-metadata-service.stage.mozaws.net/v1/metadata urls:='["http://www.baidu.com/"]' -j -v
HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 241
Content-Type: application/json; charset=utf-8
Date: Tue, 30 Aug 2016 00:03:02 GMT
ETag: W/"f1-pbJN0Ov1duzEA824OTKFTQ"
{
"request_error": "",
"url_errors": {},
"urls": {
"http://www.baidu.com/": {
"favicon_url": "http://www.baidu.com/img/baidu.svg",
"images": [],
"original_url": "http://www.baidu.com/",
"title": "百度一下,你就知道",
"url": "http://www.baidu.com/"
}
}
}
http://baidu.com/
(no "www.")$ http https://page-metadata-service.stage.mozaws.net/v1/metadata urls:='["http://baidu.com/"]' -j
HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 185
Content-Type: application/json; charset=utf-8
Date: Tue, 30 Aug 2016 00:03:11 GMT
ETag: W/"b9-POPHexWtNuza1L/cTgep7Q"
{
"request_error": "",
"url_errors": {},
"urls": {
"http://baidu.com/": {
"favicon_url": "http://baidu.com/favicon.ico",
"images": [],
"original_url": "http://baidu.com/",
"url": "http://baidu.com/"
}
}
}
I think if we can't find a suitable <img/>
tag via OpenGraph or Twitter meta tags, or other page-metadata-parser rules, we may want to maybe think about taking a screenshot and using that as a preview. Not sure if there'd be any risk of leaking personal information between users though. Fathom or the metadata server couldn't do any logins or anything, but any specially crafted URLs could always have some sort of tricky areas.
A crude little module which lets you point to a URL, try and extract any RSS feed URL, then scrapes the RSS items, so we can pass them to the metadata server: https://github.com/pdehaan/fetch-site-rss
But this should make it a lot easier to scrape a bunch of the tippy-top-sites or Alexa Top 100 sites where we don't necessarily care about the homepage and want to test random inner content pages.
... or maybe not for tippy-top-sites. I'm seeing errors on about 144 of the [173 total] URLs because it can't find an RSS feed:
Some of those are bugs in my code where I should be checking for alternate RSS URLs (like "atom" or something, or where I'm not properly normalizing the URL if it's protocol-less or relative), but still...
as we move to showing more highlights, we should make sure that top sites have proper meta data on the fathom metadata service, such as wikipedia etc...