kba / rssscrpr

Scrape web content to RSS feeds
https://rssscrpr.herokuapp.com/
MIT License
1 stars 2 forks source link

can you provide me the example #21

Open mulysatest opened 7 years ago

mulysatest commented 7 years ago

First of all thank you for your great script. I tried the demo and learn the syntax from wiki to extract the content from html to be RSS. However I think it syntax is really difficult to understand. I can't managed to extract proper content etc .. Can you give me some demo on how to extact this RSS from this page: http://www.popularmechanics.com/search/drone?

Thank you in advance.

mulysatest commented 7 years ago

@kba Any idea?

kba commented 7 years ago

Hi, thanks for the interest. @zuphilip made some very helpful posts in the wiki that could help you get started.

Here's an example setup for your site: https://rssscrpr.herokuapp.com/api.php?url=http%3A%2F%2Fwww.popularmechanics.com%2Fsearch%2Fdrone&action=scrape-html&scraper=XpathScraper&scraper_xpathItem=%2F%2Fli%5B%40class%3D%22search-results--result++search-results--single-item%22%5D&scraper_xpathTitle=.%2F%2Fa%5B%40class%3D%22search-results--title+link+link-txt%22%5D&scraper_xpathLink=.%2F%2Fa%2F%40href&scraper_xpathAuthor=.%2Ftd%5B2%5D%2Ftext()%5B1%5D&scraper_xpathDescription=%2F%2Fdiv%5B%40class%3D%22search-results--abstract%22%5D&scraper_xpathDate=%2F%2Fspan%5B%40class%3D%22search-results--date%22%5D&scraper_feedTitle=Search+Results+for+%27drone%27+in+Popular+Mechanics&fetcher=HttpFetcher&parser=HTMLParser

mulysatest commented 7 years ago

@kba Thank you for your quick respond, does it possible to include the img to the content?

kba commented 7 years ago

Try changing the xpathDescription to e.g. .//img. Though I think they lazy-load the images with javascript, so the image path is in data-src not src so browsers won't display it.

mulysatest commented 7 years ago

I tried that already with .//img[@class="swap-image lazy-loaded"] and it got this error Could not scrape description, check your xpath

kba commented 7 years ago

I think that class is added by the browser. rssscrpr will not execute any Javascript. Look at the source code of the HTML page as it is delivered to your browser, (ctrl-u instead of "inspect element" or curl <url> on the command line. The src attribute doesn't contain a reference to the real image but to some placeholder. I'm afraid this is not possible with pure Xpath, you'd need some postprocessing step for setting the data-src attribute value as the src.