Closed pdehaan closed 2 years ago
A bit of a curious case, but should we strip any newlines from the parsed title value?
title
I found the following markup in a random page:
<title> Imperfect -Ugly produce delivery in Oakland and Berkeley :: FAQ </title>
And scraping that page indeed returns a \n in the title value:
\n
"title": "Imperfect -Ugly produce delivery in Oakland and Berkeley\n :: FAQ",
Full response below:
$ http POST https://metadata.dev.mozaws.net/v1/metadata urls:='["http://shop.imperfectproduce.com/pages.php?pageid=27&xid=db023392b90263112dd62807c8814d8a"]' -j -v POST /v1/metadata HTTP/1.1 Accept: application/json Accept-Encoding: gzip, deflate Connection: keep-alive Content-Length: 103 Content-Type: application/json; charset=utf-8 Host: metadata.dev.mozaws.net User-Agent: HTTPie/0.9.1 { "urls": [ "http://shop.imperfectproduce.com/pages.php?pageid=27&xid=db023392b90263112dd62807c8814d8a" ] } HTTP/1.1 200 OK Connection: keep-alive Content-Length: 801 Content-Type: application/json; charset=utf-8 Date: Fri, 12 Aug 2016 00:00:12 GMT ETag: W/"321-F7b8qrfWsYA4UlqUUNvgqg" Server: nginx/1.9.9 Strict-Transport-Security: max-age=31536000 X-Powered-By: Express { "error": "", "urls": { "http://shop.imperfectproduce.com/pages.php?pageid=27&xid=db023392b90263112dd62807c8814d8a": { "description": "Imperfect offers home and office delivery of 'ugly' produce for 30% less than grocery store prices. We are located in Oakland, California and deliver fruits and vegetables to Oakland, Berkeley, and Emeryville", "favicon_url": "http://shop.imperfectproduce.com/favicon.ico", "images": [ { "entropy": 1, "height": 500, "url": "http://shop.imperfectproduce.com/skin1/images/heartlogo.png", "width": 500 } ], "original_url": "http://shop.imperfectproduce.com/pages.php?pageid=27&xid=db023392b90263112dd62807c8814d8a", "title": "Imperfect -Ugly produce delivery in Oakland and Berkeley\n :: FAQ", "url": "http://shop.imperfectproduce.com/pages.php?pageid=27&xid=db023392b90263112dd62807c8814d8a" } } }
And running this through my lame proxy diff tool shows that Embedly proxy seemingly strips the newline, while the Fathom proxy doesn't:
$ node index --url "http://shop.imperfectproduce.com/pages.php?pageid=27&xid=db023392b90263112dd62807c8814d8a" Scraping http://shop.imperfectproduce.com/pages.php?pageid=27&xid=db023392b90263112dd62807c8814d8a: https://embedly-proxy.services.mozilla.com/v2/extract: 179.735ms https://metadata.dev.mozaws.net/v1/metadata: 397.871ms http://shop.imperfectproduce.com/pages.php?pageid=27&xid=db023392b90263112dd62807c8814d8a: ... title: embedly: Imperfect -Ugly produce delivery in Oakland and Berkeley :: FAQ fathom: """ Imperfect -Ugly produce delivery in Oakland and Berkeley :: FAQ """ TOTAL TIME: 426.121ms
A bit of a curious case, but should we strip any newlines from the parsed
title
value?I found the following markup in a random page:
And scraping that page indeed returns a
\n
in thetitle
value:Full response below:
And running this through my lame proxy diff tool shows that Embedly proxy seemingly strips the newline, while the Fathom proxy doesn't: