j0k3r / graby-site-config

Graby site config files
Other
19 stars 30 forks source link

nature.com Improvement #42

Closed fgtham closed 3 years ago

fgtham commented 3 years ago

This patch makes article body extraction for nature.com more exact:

--- a/nature.com.txt    2021-07-23 12:11:36.331873505 +0200
+++ b/nature.com.txt    2021-07-23 12:11:17.747730246 +0200
@@ -2,7 +2,7 @@
 date: //meta[@name="dc.date"]/@content
 date: //meta[@name="prism.publicationDate"]/@content
 author: //meta[@name='dc.creator']/@content
-body: //div[contains(concat(' ',normalize-space(@class),' '),' article__body ')] | //div[contains(concat(' ',normalize-space(@class),' '),' article-body ')]
+body: //div[contains(concat(' ',normalize-space(@class),' '),' article__body ')] | //div[contains(concat(' ',normalize-space(@class),' '),' article-body ')] | //div[contains(concat(' ',normalize-space(@class),' '),' c-article-body ')]

 strip: //div[contains(concat(' ',normalize-space(@id),' '),' further-reading-section ')]
j0k3r commented 3 years ago

That's a good news but you have to submit your patch to https://github.com/fivefilters/ftr-site-config instead