postlight / parser

📜 Extract meaningful content from the chaos of a web page
https://reader.postlight.com
Apache License 2.0
5.41k stars 442 forks source link

feat: Add a custom extractor for www.engadget.com. #552

Closed jbrayton closed 2 years ago

jbrayton commented 4 years ago

Add a custom extractor for www.engadget.com.

Engadget articles have dates, but I was unable to find one in a format I could parse. There are strings like "2h ago" and tags with blank values such as this:

<meta class="swiftype" name="published_at" data-type="date" value="">

So the extractor always returns a null date.

Engadget articles also have lead images, but I was unable to return the value. For example, the fixture has:

<meta value="https://o.aolcdn.com/images/dims?resize=1200%2C630&amp;crop=1200%2C630%2C0%2C0&amp;quality=80&#x2111;uri=https%3A%2F%2Fs.yimg.com%2Fos%2Fcreatr-images%2F2020-04%2F7e5e3a50-8658-11ea-befb-f52e76d9e7b2&amp;client=amp-blogside-v2&amp;signature=193a0258fa9a401d2f1cdfc41909ac01e4db3147" name="og:image">

If I put a simpler URL in that value, I could select the image. I think the &#x2111; sequence in the URL is messing things up. I did incorporate lead images into the HTML content.

If someone reviewing this thinks there is a good way to address these issues I am eager to do that.