j0k3r / graby

Graby helps you extract article content from web pages
MIT License
363 stars 73 forks source link

Handle "if_page_contains" for "next_page_link" #193

Closed j0k3r closed 5 years ago

j0k3r commented 5 years ago

single_page_link has priority over next_page_link when both are defined the site config file and if_page_contains rules are defined.

Problem appear when you want to fetch the homepage of rollingstone.com for example (yeah bad idea anyway). The source contains the next_page_link defined in the site config. But as graby doesn't handle if_page_contains for next_page_link it started to fetch endlessly all pages from rollingstone homepage. Gasp.

Following https://github.com/j0k3r/graby/pull/190

coveralls commented 5 years ago

Coverage Status

Coverage increased (+0.01%) to 97.764% when pulling 1b271a46d4d873de40dcfc7dc6ec2abfaccc0753 on feature/page-contains-next-page-link into 802d22d2d1f5de19feeaab5dcf70142bb1ebc12b on master.