mozilla / page-metadata-service

DEPRECATED - A RESTful service that returns the metadata about a given URL.
Mozilla Public License 2.0
19 stars 8 forks source link

Parser fails on activity stream dev add-on download page #49

Closed pdehaan closed 8 years ago

pdehaan commented 8 years ago

Re: https://moz-activity-streams-dev.s3.amazonaws.com/dist/latest.html

$ http POST https://metadata.dev.mozaws.net/v1/metadata urls:='["https://moz-activity-streams-dev.s3.amazonaws.com/dist/latest.html"]' Content-Type:application/json -v

POST /v1/metadata HTTP/1.1
Accept: application/json
Accept-Encoding: gzip, deflate
Connection: keep-alive
Content-Length: 80
Content-Type: application/json
Host: metadata.dev.mozaws.net
User-Agent: HTTPie/0.9.1

{
    "urls": [
        "https://moz-activity-streams-dev.s3.amazonaws.com/dist/latest.html"
    ]
}

HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 22
Content-Type: application/json; charset=utf-8
Date: Thu, 11 Aug 2016 20:45:01 GMT
ETag: W/"16-urTtGfwwfQX5N25qpNbXOg"
Server: nginx/1.9.9
Strict-Transport-Security: max-age=31536000
X-Powered-By: Express

{
    "error": "",
    "urls": {}
}

My request looks legit, but it seems to be choking somewhere and not returning a result in the urls{} response object.

Further investigation needed, but from the looks of view-source:https://moz-activity-streams-dev.s3.amazonaws.com/dist/latest.html it looks very barebones and basic and should work.

Embed.ly returns some response, but it isn't great [because our download page is super minimal]: http://embed.ly/docs/explore/extract?url=https%3A%2F%2Fmoz-activity-streams-dev.s3.amazonaws.com%2Fdist%2Flatest.html

https://validator.w3.org/nu/#textarea seems to suggest that our HTML skills aren't great, and we're throwing invalid markup onto The Internet:

  1. Error: Bad value utf8 for attribute charset on element meta: utf8 is not a preferred encoding name. The preferred label for this encoding is utf-8.
  2. Error: Element head is missing a required instance of child element title.
  3. Error: style element between head and body.
  4. Fatal Error: Cannot recover after last error. Any further errors will be ignored.
jaredlockhart commented 8 years ago

I just tried this using the latest and got this result:

In [74]: fetch_metadata('https://moz-activity-streams-dev.s3.amazonaws.com/dist/latest.html') Out[74]: {u'request_error': u'', u'url_errors': {}, u'urls': {u'https://moz-activity-streams-dev.s3.amazonaws.com/dist/latest.html': {u'favicon_url': u'https://moz-activity-streams-dev.s3.amazonaws.com/favicon.ico', u'images': [], u'original_url': u'https://moz-activity-streams-dev.s3.amazonaws.com/dist/latest.html', u'title': u'latest activity stream experiment addon', u'url': u'https://moz-activity-streams-dev.s3.amazonaws.com/dist/latest.html'}}}

Seems to be working now, closing this.

pdehaan commented 8 years ago

@jaredkerim Yes, I fixed that specific page w/ my glorious https://github.com/mozilla/activity-stream/pull/1076 PR.

But the question remains on whether we have the core bug fixed (where an invalid page returns no errors or metadata), or if we've fixed that issue with our improved promise rejection handling and sentry reporting.