mozilla / page-metadata-parser

DEPRECATED - A Javascript library for parsing metadata on a web page.
https://www.npmjs.com/package/page-metadata-parser
Mozilla Public License 2.0
271 stars 42 forks source link

Strip newlines from title value? #39

Closed pdehaan closed 2 years ago

pdehaan commented 8 years ago

A bit of a curious case, but should we strip any newlines from the parsed title value?

I found the following markup in a random page:

<title>
Imperfect -Ugly produce delivery in Oakland and Berkeley
 :: FAQ
</title>

And scraping that page indeed returns a \n in the title value:

"title": "Imperfect -Ugly produce delivery in Oakland and Berkeley\n :: FAQ",

Full response below:

$ http POST https://metadata.dev.mozaws.net/v1/metadata urls:='["http://shop.imperfectproduce.com/pages.php?pageid=27&xid=db023392b90263112dd62807c8814d8a"]' -j -v

POST /v1/metadata HTTP/1.1
Accept: application/json
Accept-Encoding: gzip, deflate
Connection: keep-alive
Content-Length: 103
Content-Type: application/json; charset=utf-8
Host: metadata.dev.mozaws.net
User-Agent: HTTPie/0.9.1

{
    "urls": [
        "http://shop.imperfectproduce.com/pages.php?pageid=27&xid=db023392b90263112dd62807c8814d8a"
    ]
}

HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 801
Content-Type: application/json; charset=utf-8
Date: Fri, 12 Aug 2016 00:00:12 GMT
ETag: W/"321-F7b8qrfWsYA4UlqUUNvgqg"
Server: nginx/1.9.9
Strict-Transport-Security: max-age=31536000
X-Powered-By: Express

{
    "error": "",
    "urls": {
        "http://shop.imperfectproduce.com/pages.php?pageid=27&xid=db023392b90263112dd62807c8814d8a": {
            "description": "Imperfect offers home and office delivery of 'ugly' produce for 30% less than grocery store prices.  We are located in Oakland, California and deliver fruits and vegetables to Oakland, Berkeley, and Emeryville",
            "favicon_url": "http://shop.imperfectproduce.com/favicon.ico",
            "images": [
                {
                    "entropy": 1,
                    "height": 500,
                    "url": "http://shop.imperfectproduce.com/skin1/images/heartlogo.png",
                    "width": 500
                }
            ],
            "original_url": "http://shop.imperfectproduce.com/pages.php?pageid=27&xid=db023392b90263112dd62807c8814d8a",
            "title": "Imperfect -Ugly produce delivery in Oakland and Berkeley\n :: FAQ",
            "url": "http://shop.imperfectproduce.com/pages.php?pageid=27&xid=db023392b90263112dd62807c8814d8a"
        }
    }
}

And running this through my lame proxy diff tool shows that Embedly proxy seemingly strips the newline, while the Fathom proxy doesn't:

$ node index --url "http://shop.imperfectproduce.com/pages.php?pageid=27&xid=db023392b90263112dd62807c8814d8a"

Scraping http://shop.imperfectproduce.com/pages.php?pageid=27&xid=db023392b90263112dd62807c8814d8a:

https://embedly-proxy.services.mozilla.com/v2/extract: 179.735ms
https://metadata.dev.mozaws.net/v1/metadata: 397.871ms

http://shop.imperfectproduce.com/pages.php?pageid=27&xid=db023392b90263112dd62807c8814d8a:
  ...
  title:
    embedly: Imperfect -Ugly produce delivery in Oakland and Berkeley :: FAQ
    fathom:
      """
        Imperfect -Ugly produce delivery in Oakland and Berkeley
         :: FAQ
      """

TOTAL TIME: 426.121ms