mozilla / readability

A standalone version of the readability lib
Other
8.34k stars 579 forks source link

fix: resolve the bug of checkByline when there is a byline attribute in metadata #869

Closed fu1996 closed 5 days ago

fu1996 commented 1 month ago

Resolve the issue where <meta name="og: article: author" content="xxxx"> when there is author information in the meta tag of the head, and if there is also the following content in the body <strong id="timeline">yyyy</strong> calling checkByline returns incorrect results and the yyyy information is not displayed in the content

cmkm commented 1 month ago

@fu1996 Hi there, thanks for your PR! It would be helpful to understand a more concrete example of the problem this patch fixes. I think this may also result in byline duplication in some cases. Would you be able to share which issue this addresses or provide an example website?

fu1996 commented 1 month ago

content

@cmkm Currently, on this website, https://www.accesswire.com/860018/network-to-code-and-internet2-team-up-to-pioneer-network-automation-across-research-and-education-community , The header already contains information about <meta name="og: article: author" and includes sections of<strong id="dateline">NEW York, NY/ACCESSWIRE/May 7, 2024/</strong> in the HTML of the main text. This will result in the final parsed content result, losing the section of NEW York, NY/ACCESSWIRE/May 7, 2024/.

fchasen commented 5 days ago

Thanks, that makes sense in this use case to keep the content.

This could definitely cause some articles to have duplicated text, but I think we'd rather go that way then remove a byline that was needed.