mozilla / page-metadata-service

DEPRECATED - A RESTful service that returns the metadata about a given URL.
Mozilla Public License 2.0
19 stars 8 forks source link

wikipedia doesn't have a lead image with current fathom ruleset #85

Closed edwindotcom closed 2 years ago

edwindotcom commented 8 years ago

use httpie to do this:

http POST https://metadata.dev.mozaws.net/v1/metadata urls:='["https://en.wikipedia.org/wiki/Katie_Ledecky"]' content-type:application/json -v
POST /v1/metadata HTTP/1.1
Accept: application/json
Accept-Encoding: gzip, deflate
Connection: keep-alive
Content-Length: 57
Host: metadata.dev.mozaws.net
User-Agent: HTTPie/0.9.3
content-type: application/json

{
    "urls": [
        "https://en.wikipedia.org/wiki/Katie_Ledecky"
    ]
}

HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 351
Content-Type: application/json; charset=utf-8
Date: Fri, 26 Aug 2016 00:02:46 GMT
ETag: W/"15f-yq7jMGNJTHESb0Xw52QPlg"
Server: nginx/1.9.9
Strict-Transport-Security: max-age=31536000

{
    "request_error": "",
    "url_errors": {},
    "urls": {
        "https://en.wikipedia.org/wiki/Katie_Ledecky": {
            "favicon_url": "https://en.wikipedia.org/static/apple-touch/wikipedia.png",
            "images": [],
            "original_url": "https://en.wikipedia.org/wiki/Katie_Ledecky",
            "title": "Katie Ledecky - Wikipedia, the free encyclopedia",
            "url": "https://en.wikipedia.org/wiki/Katie_Ledecky"
        }
    }
}

notice images:[]

results in this:

screen shot 2016-08-25 at 5 00 48 pm

expected: embedly does have an img

edwindotcom commented 8 years ago

note there are more comments in https://github.com/mozilla/activity-stream/issues/1178

pdehaan commented 8 years ago

Current image processing rules are in mozilla/page-metadata-parser /parser.js:40-46:

  image_url: [
    ['meta[property="og:image:secure_url"]', node => node.element.getAttribute('content')],
    ['meta[property="og:image:url"]', node => node.element.getAttribute('content')],
    ['meta[property="og:image"]', node => node.element.getAttribute('content')],
    ['meta[property="twitter:image"]', node => node.element.getAttribute('content')],
    ['meta[name="thumbnail"]', node => node.element.getAttribute('content')],
  ],
pdehaan commented 8 years ago

Current meta tags for https://en.wikipedia.org/wiki/Katie_Ledecky via meta-scraper:

`$ meta-scraper -u "https://en.wikipedia.org/wiki/Katie_Ledecky"` ``` html Katie Ledecky - Wikipedia, the free encyclopedia ```

Sadly, doesn't look like any tags we can scrape that match our page-metadata-parser rules in https://github.com/mozilla/page-metadata-service/issues/85#issuecomment-243200450