postlight / parser

📜 Extract meaningful content from the chaos of a web page
https://reader.postlight.com
Apache License 2.0
5.41k stars 442 forks source link

feat: ma.ttias.be extractor #551

Closed jbrayton closed 2 years ago

jbrayton commented 4 years ago

When parsing content for cron.weekly issues, such as the one at https://ma.ttias.be/cronweekly/issue-130/, Mercury Parser would remove headings and ordered lists that were part of the content. It also demoted h1 elements to h2, giving other h2 elements the appearance of being at the same level in the organizational hierarchy of the document. This resolves these issues as follows:

The site does not have deks or lead images, so those are not in the extractor.