Closed fgtham closed 3 years ago
This patch makes article body extraction for nature.com more exact:
--- a/nature.com.txt 2021-07-23 12:11:36.331873505 +0200 +++ b/nature.com.txt 2021-07-23 12:11:17.747730246 +0200 @@ -2,7 +2,7 @@ date: //meta[@name="dc.date"]/@content date: //meta[@name="prism.publicationDate"]/@content author: //meta[@name='dc.creator']/@content -body: //div[contains(concat(' ',normalize-space(@class),' '),' article__body ')] | //div[contains(concat(' ',normalize-space(@class),' '),' article-body ')] +body: //div[contains(concat(' ',normalize-space(@class),' '),' article__body ')] | //div[contains(concat(' ',normalize-space(@class),' '),' article-body ')] | //div[contains(concat(' ',normalize-space(@class),' '),' c-article-body ')] strip: //div[contains(concat(' ',normalize-space(@id),' '),' further-reading-section ')]
That's a good news but you have to submit your patch to https://github.com/fivefilters/ftr-site-config instead
This patch makes article body extraction for nature.com more exact: