jacktuck / unfurl

Metadata scraper with support for oEmbed, Twitter Cards and Open Graph Protocol for Node.js :zap:
MIT License
474 stars 51 forks source link

Youtube: only favicon gets extracted #67

Closed trieloff closed 3 years ago

trieloff commented 3 years ago

Youtube changed its HTML a month ago and since then our tests (https://github.com/adobe/helix-embed/pull/345) have been failing when verifying the output for Youtube.

The underlying issue is a combination of making the reasonable assumption that all metadata is in the head here

https://github.com/jacktuck/unfurl/blob/db57429b369bae7e22f6983a7e19832c54101491/src/index.ts#L270-L273

and Youtube being above convention, standards, and reason:

<!DOCTYPE html>
<html
  style="font-size: 10px; font-family: Roboto, Arial, sans-serif"
  lang="de-DE"
>
  <head>
    <meta http-equiv="X-UA-Compatible" content="IE=edge" />
    <link
      rel="shortcut icon"
      href="https://www.youtube.com/s/desktop/d743f786/img/favicon.ico"
      type="image/x-icon"
    />
    <link
      rel="icon"
      href="https://www.youtube.com/s/desktop/d743f786/img/favicon_32.png"
      sizes="32x32"
    />
    <link
      rel="icon"
      href="https://www.youtube.com/s/desktop/d743f786/img/favicon_48.png"
      sizes="48x48"
    />
    <link
      rel="icon"
      href="https://www.youtube.com/s/desktop/d743f786/img/favicon_96.png"
      sizes="96x96"
    />
    <link
      rel="icon"
      href="https://www.youtube.com/s/desktop/d743f786/img/favicon_144.png"
      sizes="144x144"
    />
    <link
      rel="stylesheet"
      href="//fonts.googleapis.com/css?family=Roboto:500,300,700,400"
      name="www-roboto"
    />
    <script name="www-roboto" nonce="26OMsP9eT4h+T5PS9iXDRQ">
      if (document.fonts && document.fonts.load) {
        document.fonts.load("400 10pt Roboto", "");
        document.fonts.load("500 10pt Roboto", "");
      }
    </script>
    <link
      rel="stylesheet"
      href="//fonts.googleapis.com/css?family=YT%20Sans%3A300%2C500%2C700"
      name="www-webfont-yt-sans"
    />
    <link rel="stylesheet" href="/s/player/5dd3f3b2/www-player.css" />
    <link
      rel="stylesheet"
      href="https://www.youtube.com/s/desktop/d743f786/cssbin/www-main-desktop-watch-page-skeleton.css"
    />
    <link
      rel="stylesheet"
      href="https://www.youtube.com/s/desktop/d743f786/cssbin/www-main-desktop-player-skeleton.css"
    />
    <link
      rel="stylesheet"
      href="https://www.youtube.com/s/desktop/d743f786/cssbin/www-onepick.css"
    />
    <meta name="theme-color" content="rgba(255, 255, 255, 0.98)" />
    <link
      rel="search"
      type="application/opensearchdescription+xml"
      href="https://www.youtube.com/opensearch?locale=de_DE"
      title="YouTube"
    />
    <link
      rel="manifest"
      href="/s/notifications/manifest/manifest.json"
      crossorigin="use-credentials"
    />
  </head> <!-- END OF HEAD HERE END OF HEAD HERE END OF HEAD HERE END OF HEAD HERE END OF HEAD HERE END OF HEAD HERE END OF HEAD HERE  --->
  <body dir="ltr" no-y-overflow>
    <link
      rel="canonical"
      href="https://www.youtube.com/watch?v=ccYpEv4APec"
    /><link
      rel="alternate"
      media="handheld"
      href="https://m.youtube.com/watch?v=ccYpEv4APec"
    /><link
      rel="alternate"
      media="only screen and (max-width: 640px)"
      href="https://m.youtube.com/watch?v=ccYpEv4APec"
    /><title>
      Google Translate Sings: &quot;The Sound of Silence&quot; (Simon &amp;
      Garfunkel) - YouTube</title
    ><meta
      name="title"
      content='Google Translate Sings: "The Sound of Silence" (Simon &amp; Garfunkel)'
    /><meta
      name="description"
      content="SUBSCRIBE: http://bit.ly/sub2MalindaCHECK OUT MY MUSIC CHANNEL: https://bit.ly/2GsRyrqPATREON: http://bit.ly/MKRsupportMERCH: http://shopmalinda.com/Follow m..."
    /><meta
      name="keywords"
      content="sound of silence, parody, google translate, google translate sings, disturbed, pentatonix, performance, the sound of silence, simon and garfunkel, translator fails, translation, fail, comedy, 1960s, paul simon, official video"
    /><link rel="shortlinkUrl" href="https://youtu.be/ccYpEv4APec" /><link
      rel="alternate"
      href="android-app://com.google.android.youtube/http/www.youtube.com/watch?v=ccYpEv4APec"
    /><link
      rel="alternate"
      href="ios-app://544007664/vnd.youtube/www.youtube.com/watch?v=ccYpEv4APec"
    /><link
      rel="alternate"
      type="application/json+oembed"
      href="http://www.youtube.com/oembed?format=json&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DccYpEv4APec"
      title='Google Translate Sings: "The Sound of Silence" (Simon &amp; Garfunkel)'
    /><link
      rel="alternate"
      type="text/xml+oembed"
      href="http://www.youtube.com/oembed?format=xml&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DccYpEv4APec"
      title='Google Translate Sings: "The Sound of Silence" (Simon &amp; Garfunkel)'
    /><link
      rel="image_src"
      href="https://i.ytimg.com/vi/ccYpEv4APec/maxresdefault.jpg"
    /><meta property="og:site_name" content="YouTube" /><meta
      property="og:url"
      content="https://www.youtube.com/watch?v=ccYpEv4APec"
    /><meta
      property="og:title"
      content='Google Translate Sings: "The Sound of Silence" (Simon &amp; Garfunkel)'
    /><meta
      property="og:image"
      content="https://i.ytimg.com/vi/ccYpEv4APec/maxresdefault.jpg"
    /><meta property="og:image:width" content="1280" /><meta
      property="og:image:height"
      content="720"
    /><meta
      property="og:description"
      content="SUBSCRIBE: http://bit.ly/sub2MalindaCHECK OUT MY MUSIC CHANNEL: https://bit.ly/2GsRyrqPATREON: http://bit.ly/MKRsupportMERCH: http://shopmalinda.com/Follow m..."
    /><meta property="al:ios:app_store_id" content="544007664" /><meta
      property="al:ios:app_name"
      content="YouTube"
    /><meta
      property="al:ios:url"
      content="vnd.youtube://www.youtube.com/watch?v=ccYpEv4APec&amp;feature=applinks"
    /><meta
      property="al:android:url"
      content="vnd.youtube://www.youtube.com/watch?v=ccYpEv4APec&amp;feature=applinks"
    /><meta
      property="al:web:url"
      content="http://www.youtube.com/watch?v=ccYpEv4APec&amp;feature=applinks"
    /><meta property="og:type" content="video.other" /><meta
      property="og:video:url"
      content="https://www.youtube.com/embed/ccYpEv4APec"
    /><meta
      property="og:video:secure_url"
      content="https://www.youtube.com/embed/ccYpEv4APec"
    /><meta property="og:video:type" content="text/html" /><meta
      property="og:video:width"
      content="1280"
    /><meta property="og:video:height" content="720" /><meta
      property="al:android:app_name"
      content="YouTube"
    /><meta
      property="al:android:package"
      content="com.google.android.youtube"
    /><meta property="og:video:tag" content="sound of silence" /><meta
      property="og:video:tag"
      content="parody"
    /><meta property="og:video:tag" content="google translate" /><meta
      property="og:video:tag"
      content="google translate sings"
    /><meta property="og:video:tag" content="disturbed" /><meta
      property="og:video:tag"
      content="pentatonix"
    /><meta property="og:video:tag" content="performance" /><meta
      property="og:video:tag"
      content="the sound of silence"
    /><meta property="og:video:tag" content="simon and garfunkel" /><meta
      property="og:video:tag"
      content="translator fails"
    /><meta property="og:video:tag" content="translation" /><meta
      property="og:video:tag"
      content="fail"
    /><meta property="og:video:tag" content="comedy" /><meta
      property="og:video:tag"
      content="1960s"
    /><meta property="og:video:tag" content="paul simon" /><meta
      property="og:video:tag"
      content="official video"
    /><meta property="fb:app_id" content="87741124305" /><meta
      name="twitter:card"
      content="player"
    /><meta name="twitter:site" content="@youtube" /><meta
      name="twitter:url"
      content="https://www.youtube.com/watch?v=ccYpEv4APec"
    /><meta
      name="twitter:title"
      content='Google Translate Sings: "The Sound of Silence" (Simon &amp; Garfunkel)'
    /><meta
      name="twitter:description"
      content="SUBSCRIBE: http://bit.ly/sub2MalindaCHECK OUT MY MUSIC CHANNEL: https://bit.ly/2GsRyrqPATREON: http://bit.ly/MKRsupportMERCH: http://shopmalinda.com/Follow m..."
    /><meta
      name="twitter:image"
      content="https://i.ytimg.com/vi/ccYpEv4APec/maxresdefault.jpg"
    /><meta name="twitter:app:name:iphone" content="YouTube" /><meta
      name="twitter:app:id:iphone"
      content="544007664"
    /><meta name="twitter:app:name:ipad" content="YouTube" /><meta
      name="twitter:app:id:ipad"
      content="544007664"
    /><meta
      name="twitter:app:url:iphone"
      content="vnd.youtube://www.youtube.com/watch?v=ccYpEv4APec&amp;feature=applinks"
    /><meta
      name="twitter:app:url:ipad"
      content="vnd.youtube://www.youtube.com/watch?v=ccYpEv4APec&amp;feature=applinks"
    /><meta name="twitter:app:name:googleplay" content="YouTube" /><meta
      name="twitter:app:id:googleplay"
      content="com.google.android.youtube"
    /><meta
      name="twitter:app:url:googleplay"
      content="https://www.youtube.com/watch?v=ccYpEv4APec"
    /><meta
      name="twitter:player"
      content="https://www.youtube.com/embed/ccYpEv4APec"
    /><meta name="twitter:player:width" content="1280" /><meta
      name="twitter:player:height"
      content="720"
    />

(HTML reformatted and all script and style tags removed)

As you can see, most of the interesting metadata (even title) is outside the head.

I will submit a PR to address that.

github-actions[bot] commented 3 years ago

:tada: This issue has been resolved in version 5.2.1 :tada:

The release is available on:

Your semantic-release bot :package::rocket:

jacktuck commented 3 years ago

If it turns out title is often in the head but other meta is in the body we could in the future just remove this optimisation all together or default to not having it and add a option flag for it.