Closed pdehaan closed 2 years ago
Here's the docs from the Embedly API (http://docs.embed.ly/docs/extract):
type
Returns the type of the document at this URL, they can be one of the following:
html
: The most common response. The resource is anhtml
document.text
: The response is a plain text document.image
: This is a static viewable image.video
: This is a playable video.audio
: This is a playable audio.rss
: The resource is an rss feed.xml
: The resource is an xml document.atom
: The resource is an atom feed.json
: The resource is a json document.ppt
: The resource is a PowerPoint document.link
: This is a general embed that may not contain HTML.error
: When accessing multiple urls at once Embedly will not throw HTTP errors as normal. Instead, it will return an error type response that includes the url, error_message and error_code.
Not sure if we should just default it to html
if we can't detect a type
, but it looks like we may just be setting the type
based on whatever the user specified in the og:type
OpenGraph metadata (see /parser.js:63-65) so we could get literally anything — or nothing.
This is a bit more obvious if you try scraping an image (which doesn't have a DOM or any meta info):
$ curl -XPOST -H "content-type: application/json" -d '{"urls": ["http://i.imgur.com/eFgAHrH.gif"]}' http://localhost:7001 | JSON
{
"error": "",
"urls": {
"http://i.imgur.com/eFgAHrH.gif": {
"url": "http://i.imgur.com/eFgAHrH.gif",
"original_url": "http://i.imgur.com/eFgAHrH.gif",
"provider_url": "http://i.imgur.com/eFgAHrH.gif",
"favicon_url": "http://i.imgur.com/favicon.ico"
}
}
}
versus:
{
"error": "",
"urls": {
"http://www.cnn.com": {
"description": "View the latest news and breaking news today for U.S., world, weather, entertainment, politics and health at CNN.com.",
"icon_url": "http://i.cdn.turner.com/cnn/.e/img/3.0/global/misc/apple-touch-icon.png",
"image_url": "http://i.cdn.turner.com/cnn/.e1mo/img/4.0/logos/menu_politics.png",
"title": "CNN - Breaking News, Latest News and Videos",
"type": "website",
"url": "http://www.cnn.com",
"original_url": "http://www.cnn.com",
"provider_url": "http://www.cnn.com",
"favicon_url": "http://www.cnn.com/favicon.ico"
}
}
}
For my own future reference, here's a better CLI for scraping contents:
http https://page-metadata.services.mozilla.com/v1/metadata urls:='["https://www.youtube.com"]' -j
HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 488
Content-Type: application/json; charset=utf-8
Date: Thu, 01 Sep 2016 23:06:03 GMT
ETag: W/"1e8-m0pbvGJUhK/jPz4lZGyTQQ"
{
"request_error": "",
"url_errors": {},
"urls": {
"https://www.youtube.com": {
"description": "Enjoy the videos and music you love, upload original content, and share it all with friends, family, and the world on YouTube.",
"favicon_url": "https://www.youtube.com/yts/img/favicon_32-vfl8NGn4k.png",
"images": [
{
"entropy": 1,
"height": 500,
"url": "https://s.ytimg.com/yts/img/yt_1200-vfl4C3T0K.png",
"width": 500
}
],
"original_url": "https://www.youtube.com",
"title": "YouTube",
"url": "https://www.youtube.com"
}
}
}
And the results from my handy <meta>
, <link>
, and <title>
meta-scraper tool-io:
... Actually seems to be yet another case of "homepages suck, details pages are rad...":
Not sure if we have a response schema or if we guarantee that you'll get all fields, but I noticed this while browsing using the super cool ffmetadata tool.
youtube.com:
Whereas www.cnn.com gives me slightly different fields:
cnn.com: