Closed j0k3r closed 7 years ago
I had a look and I think what's going on is the single_page_link
directive causes the print version to be retrieved. The markup on that page is somewhat different. For the last test URL, this is in fact what is retrieved: https://www.washingtonpost.com/lifestyle/magazine/the-sorry-fate-of-a-tech-pioneer-halsey-minor-and-historic-virginia-estate-carters-grove/2012/05/30/gJQAwdJG4U_print.html
But makes sense to remove the author element from the body if it's already getting extracted correctly.
Good catch! Thanks
For that url: https://www.washingtonpost.com/world/national-security/trump-to-meet-russian-foreign-minister-at-the-white-house-as-moscows-alleged-election-interference-is-back-in-spotlight/2017/05/10/c6717e4c-34f3-11e7-b412-62beef8121f7_story.html
The first content is converted to:
When the original content is: