postlight / parser

📜 Extract meaningful content from the chaos of a web page
https://reader.postlight.com
Apache License 2.0
5.37k stars 443 forks source link

Wrong content matched for a2hosting.com/kb/ articles #666

Open iandunn opened 2 years ago

iandunn commented 2 years ago

Expected Behavior

https://www.a2hosting.com/kb/installable-applications/optimization-and-configuration/wordpress2/using-apc-or-opcache-with-wordpress?jr=on should show the contents of the article ( div.text-commens [sic] ).

This article discusses using APC or OPcache...

Current Behavior

It shows the contents of an aside/widget (div.article-subscribe)

Subscribe to receive weekly cutting edge tips, strategies, and news you need to grow your web business...

Steps to Reproduce

  1. Visit https://www.a2hosting.com/kb/installable-applications/optimization-and-configuration/wordpress2/using-apc-or-opcache-with-wordpress
  2. Activate the extension

Possible Solution

IIRC, other parsers will take the length into account when the markup doesn't use good semantics. It's probably safe to assume that the longest section is the main content.

That will never be perfect, though, so it could be really helpful to give the user a way to quickly swap between different elements, almost like a photo gallery slider. If the parser thinks there are 3 elements that are likely to hold the content, then it could show the one with the highest probability. Then, there could be left/right arrows on the edges the reader to swap between different elements.