mozilla / page-metadata-service

DEPRECATED - A RESTful service that returns the metadata about a given URL.
Mozilla Public License 2.0
19 stars 8 forks source link

Add support for meta keywords #76

Closed pdehaan closed 2 years ago

pdehaan commented 8 years ago

We added basic support for meta keywords in https://github.com/mozilla/page-metadata-parser/pull/43 but we still need to tweak the page-metadata-service to support keywords as well (which maybe we need to tweak so we get upstream parser results by default instead of having to explicitly opt-in — unless that is stupid).

Fixing appears superficially simple:

diff --git a/app/metadata.js b/app/metadata.js
index dff83b3..c173637 100644
--- a/app/metadata.js
+++ b/app/metadata.js
@@ -22,6 +22,7 @@ function getDocumentMetadata(url, window) {
     original_url: url,
     title: metadata.title,
     description: metadata.description,
+    keywords: metadata.keywords,
     favicon_url: metadata.icon_url ? makeUrlAbsolute(url, metadata.icon_url) : makeUrlAbsolute(url, '/favicon.ico'),
     images: []
   };

Ref: https://github.com/mozilla/page-metadata-parser/issues/47; "Extended support for page keywords" Ref: https://github.com/mozilla/page-metadata-parser/issues/48; "Return keywords as array instead of string?"

pdehaan commented 8 years ago

Spotted https://www.youtube.com/channel/UCXNqkD43iJYHX6hwBXam3jg, which has some seriously messed up <meta name="keywords" ...>, but better open graph tags. Not sure if there is anything we can do, apart from support open graph:

<meta name="keywords" content="&quot;double dream hands&quot; &quot;john jacobson&quot; music dance fun &quot;sprint guy&quot;">
...
<meta property="og:video:tag" content="double dream hands">
<meta property="og:video:tag" content="john jacobson">
<meta property="og:video:tag" content="music">
<meta property="og:video:tag" content="dance">
<meta property="og:video:tag" content="fun">
<meta property="og:video:tag" content="sprint guy">

And via our Metadata parser:

$ http http://localhost:7001/v1/metadata urls:='["https://www.youtube.com/channel/UCXNqkD43iJYHX6hwBXam3jg"]' -j -v

POST /v1/metadata HTTP/1.1
Accept: application/json
Accept-Encoding: gzip, deflate
Connection: keep-alive
Content-Length: 70
Content-Type: application/json; charset=utf-8
Host: localhost:7001
User-Agent: HTTPie/0.9.1

{
    "urls": [
        "https://www.youtube.com/channel/UCXNqkD43iJYHX6hwBXam3jg"
    ]
}

HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 636
Content-Type: application/json; charset=utf-8
Date: Tue, 23 Aug 2016 22:25:16 GMT
ETag: W/"27c-8YtcCl1pW4ut1hZuBN3QHw"

{
    "request_error": "",
    "url_errors": {},
    "urls": {
        "https://www.youtube.com/channel/UCXNqkD43iJYHX6hwBXam3jg": {
            "description": "All video John Jacobson!",
            "favicon_url": "https://www.youtube.com/yts/img/favicon_32-vfl8NGn4k.png",
            "images": [
                {
                    "entropy": 1,
                    "height": 500,
                    "url": "https://yt3.ggpht.com/-m1dhEVB67Kg/AAAAAAAAAAI/AAAAAAAAAAA/nSzR87de3G8/s900-c-k-no-mo-rj-c0xffffff/photo.jpg",
                    "width": 500
                }
            ],
            "keywords": "\"double dream hands\" \"john jacobson\" music dance fun \"sprint guy\"",
            "original_url": "https://www.youtube.com/channel/UCXNqkD43iJYHX6hwBXam3jg",
            "title": "John Jacobson",
            "type": "profile",
            "url": "https://www.youtube.com/user/JhnJacobson"
        }
    }
}

Sadly, it looks like Embedly fails hard on the keywords[] and just returns an empty Array:

$ http https://embedly-proxy.services.mozilla.com/v2/extract urls:='["https://www.youtube.com/channel/UCXNqkD43iJYHX6hwBXam3jg"]' -j -v
...

"keywords": [],

http://embed.ly/docs/explore/extract?url=https%3A%2F%2Fwww.youtube.com%2Fuser%2FJhnJacobson