j0k3r / graby

Graby helps you extract article content from web pages
MIT License
363 stars 73 forks source link

next_page_link and dynamically loaded pages not following strip commands #258

Open kour1er opened 3 years ago

kour1er commented 3 years ago

I'm not sure if this is an issue, or me doing something dumb :)

When using the 'next_page_lnk' command and the strip command, the strip works on the first page, but the dynamically loaded subsequent pages don't seem to obey the strip command. For example: if we look at this arstechnica page Huawei’s HarmonyOS: “Fake it till you make it” meets OS development - it has four pages. If I use the following config as an example:

body: //div[contains(@class,'article-content')]
title: //div[@id='story']//h2[@class='title']
date: //div[@class='byline']/span[@class='posted']//abbr/@original-title
date: //div[@class='byline']/span[@class='posted']//abbr
date: //*[@class='byline']//time[@class='date']
author: //p[@class='byline']/span[@class='author']
author: //p[@class='byline']/a
next_page_link: //nav//a[contains(text(), 'Next')]/@href
next_page_link: //span[@class='numbers']//a/span[@class='next']/..
next_page_link: //nav//a/span[contains(text(), 'Next')]/../@href
strip: //p

The p tag only gets stripped on the first page, not on the additional three pages (this is obviously a silly example stripping all p tags but it's just to illustrate). Is there anyway to force the rules (in this case the silly strip: //p) on the dynamically loaded subsequent pages?