j0k3r / graby

Graby helps you extract article content from web pages
MIT License
368 stars 73 forks source link

Link goes removed #103

Closed j0k3r closed 7 years ago

j0k3r commented 7 years ago

For that url: https://www.washingtonpost.com/world/national-security/trump-to-meet-russian-foreign-minister-at-the-white-house-as-moscows-alleged-election-interference-is-back-in-spotlight/2017/05/10/c6717e4c-34f3-11e7-b412-62beef8121f7_story.html

The first content is converted to:

<h3>By  and ,</h3>

When the original content is:

<span class="pb-byline" itemprop="author" itemscope="" itemtype="http://schema.org/Person">
    By
    <a href="https://www.washingtonpost.com/people/carol-morello/">
        <span itemprop="name">Carol Morello</span>
    </a> 
    and 
    <a href="https://www.washingtonpost.com/people/greg-miller/">
        <span itemprop="name">Greg Miller</span>
    </a>
</span>
fivefilters commented 7 years ago

I had a look and I think what's going on is the single_page_link directive causes the print version to be retrieved. The markup on that page is somewhat different. For the last test URL, this is in fact what is retrieved: https://www.washingtonpost.com/lifestyle/magazine/the-sorry-fate-of-a-tech-pioneer-halsey-minor-and-historic-virginia-estate-carters-grove/2012/05/30/gJQAwdJG4U_print.html

But makes sense to remove the author element from the body if it's already getting extracted correctly.

j0k3r commented 7 years ago

Good catch! Thanks