mozilla / page-metadata-parser

DEPRECATED - A Javascript library for parsing metadata on a web page.
https://www.npmjs.com/package/page-metadata-parser
Mozilla Public License 2.0
271 stars 42 forks source link

YouTube doesn't return a "type" attribute #26

Closed pdehaan closed 2 years ago

pdehaan commented 8 years ago

Not sure if we have a response schema or if we guarantee that you'll get all fields, but I noticed this while browsing using the super cool ffmetadata tool.

youtube.com:

last_week_tonight_with_john_oliver__brexit__hbo__-_youtube

$ curl -i -XPOST -H "content-type: application/json" -d '{"urls": ["https://www.youtube.com"]}' http://localhost:7001 | JSON

HTTP/1.1 200 OK
content-type: application/json; charset=utf-8
cache-control: no-cache
content-length: 500
Date: Thu, 30 Jun 2016 20:01:21 GMT
Connection: keep-alive

{
  "error": "",
  "urls": {
    "https://www.youtube.com": {
      "description": "Enjoy the videos and music you love, upload original content, and share it all with friends, family, and the world on YouTube.",
      "icon_url": "https://s.ytimg.com/yts/img/favicon_32-vfl8NGn4k.png",
      "image_url": "//s.ytimg.com/yts/img/yt_1200-vfl4C3T0K.png",
      "title": "YouTube",
      "url": "https://www.youtube.com",
      "original_url": "https://www.youtube.com",
      "provider_url": "https://www.youtube.com",
      "favicon_url": "https://www.youtube.com/favicon.ico"
    }
  }
}

Whereas www.cnn.com gives me slightly different fields:

cnn.com:

$ curl -i -XPOST -H "content-type: application/json" -d '{"urls": ["http://www.cnn.com"]}' http://localhost:7001 | JSON

HTTP/1.1 200 OK
content-type: application/json; charset=utf-8
cache-control: no-cache
content-length: 560
Date: Thu, 30 Jun 2016 20:02:00 GMT
Connection: keep-alive

{
  "error": "",
  "urls": {
    "http://www.cnn.com": {
      "description": "View the latest news and breaking news today for U.S., world, weather, entertainment, politics and health at CNN.com.",
      "icon_url": "http://i.cdn.turner.com/cnn/.e/img/3.0/global/misc/apple-touch-icon.png",
      "image_url": "http://i.cdn.turner.com/cnn/.e1mo/img/4.0/logos/menu_politics.png",
      "title": "CNN - Breaking News, Latest News and Videos",
      "type": "website",
      "url": "http://www.cnn.com",
      "original_url": "http://www.cnn.com",
      "provider_url": "http://www.cnn.com",
      "favicon_url": "http://www.cnn.com/favicon.ico"
    }
  }
}
pdehaan commented 8 years ago

Here's the docs from the Embedly API (http://docs.embed.ly/docs/extract):

type

Returns the type of the document at this URL, they can be one of the following:

  • html: The most common response. The resource is an html document.
  • text: The response is a plain text document.
  • image: This is a static viewable image.
  • video: This is a playable video.
  • audio: This is a playable audio.
  • rss: The resource is an rss feed.
  • xml: The resource is an xml document.
  • atom: The resource is an atom feed.
  • json: The resource is a json document.
  • ppt: The resource is a PowerPoint document.
  • link: This is a general embed that may not contain HTML.
  • error: When accessing multiple urls at once Embedly will not throw HTTP errors as normal. Instead, it will return an error type response that includes the url, error_message and error_code.

Not sure if we should just default it to html if we can't detect a type, but it looks like we may just be setting the type based on whatever the user specified in the og:type OpenGraph metadata (see /parser.js:63-65) so we could get literally anything — or nothing.

pdehaan commented 8 years ago

This is a bit more obvious if you try scraping an image (which doesn't have a DOM or any meta info):

$ curl -XPOST -H "content-type: application/json" -d '{"urls": ["http://i.imgur.com/eFgAHrH.gif"]}' http://localhost:7001 | JSON

{
  "error": "",
  "urls": {
    "http://i.imgur.com/eFgAHrH.gif": {
      "url": "http://i.imgur.com/eFgAHrH.gif",
      "original_url": "http://i.imgur.com/eFgAHrH.gif",
      "provider_url": "http://i.imgur.com/eFgAHrH.gif",
      "favicon_url": "http://i.imgur.com/favicon.ico"
    }
  }
}

versus:

{
  "error": "",
  "urls": {
    "http://www.cnn.com": {
      "description": "View the latest news and breaking news today for U.S., world, weather, entertainment, politics and health at CNN.com.",
      "icon_url": "http://i.cdn.turner.com/cnn/.e/img/3.0/global/misc/apple-touch-icon.png",
      "image_url": "http://i.cdn.turner.com/cnn/.e1mo/img/4.0/logos/menu_politics.png",
      "title": "CNN - Breaking News, Latest News and Videos",
      "type": "website",

      "url": "http://www.cnn.com",
      "original_url": "http://www.cnn.com",
      "provider_url": "http://www.cnn.com",
      "favicon_url": "http://www.cnn.com/favicon.ico"
    }
  }
}

For my own future reference, here's a better CLI for scraping contents:

http https://page-metadata.services.mozilla.com/v1/metadata urls:='["https://www.youtube.com"]' -j

HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 488
Content-Type: application/json; charset=utf-8
Date: Thu, 01 Sep 2016 23:06:03 GMT
ETag: W/"1e8-m0pbvGJUhK/jPz4lZGyTQQ"

{
    "request_error": "",
    "url_errors": {},
    "urls": {
        "https://www.youtube.com": {
            "description": "Enjoy the videos and music you love, upload original content, and share it all with friends, family, and the world on YouTube.",
            "favicon_url": "https://www.youtube.com/yts/img/favicon_32-vfl8NGn4k.png",
            "images": [
                {
                    "entropy": 1,
                    "height": 500,
                    "url": "https://s.ytimg.com/yts/img/yt_1200-vfl4C3T0K.png",
                    "width": 500
                }
            ],
            "original_url": "https://www.youtube.com",
            "title": "YouTube",
            "url": "https://www.youtube.com"
        }
    }
}

And the results from my handy <meta>, <link>, and <title> meta-scraper tool-io:

`$ meta-scraper -u "https://www.youtube.com"` ``` html YouTube ```

... Actually seems to be yet another case of "homepages suck, details pages are rad...":

`$ meta-scraper -u "https://www.youtube.com/watch?v=nE8P9mTffQo&feature=youtu.be&t=167"` ``` html How To Dance To... (Part 2) - Birds & EDM, The Comeback! - YouTube ```