Scrapy middleware to asynchronously handle javascript pages using requests-html.
requests-html uses pyppeteer to load javascript pages, and handles user-agent specification for you. Using requests-html is very intuitive and simple. Check out their documentation.
pip install scrapy-requests
Make twisted use Asyncio event loop And add RequestsMiddleware to the downloader middleware
TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'
DOWNLOADER_MIDDLEWARES = {
'scrapy_requests.RequestsMiddleware': 800
}
Use scrapy_requests.HtmlRequest instead of scrapy.Request
from scrapy_requests import HtmlRequest
yield HtmlRequest(url=url, callback=self.parse)
The requests will be handled by requests_html, and the request will add an additional meta varialble page
containing the HTML object.
def parse(self, response):
page = response.request.meta['page']
If you would like the page to be rendered by pyppeteer - pass True
to the render
key paramater.
yield HtmlRequest(url=url, callback=self.parse, render=True)
You could choose a more speific functionality for the HTML object.
For example - You could set up a sleep timer before loading the page, and js script execution when loading the page - doing it this way:
script = "document.body.querySelector('.btn').click();"
yield HtmlRequest(url=url, callback=self.parse, render=True, options={'sleep': 2, 'script': script})
You could pass default settings to requests-html session - specifying header, proxies, auth settings etc...
You do this by specifying an addtional variable in settings.py
DEFAULT_SCRAPY_REQUESTS_SETTINGS = {
'verify': False, # Verifying SSL certificates
'mock_browser': True, # Mock browser user-agent
'browser_args': ['--no-sandbox', '--proxy-server=x.x.x.x:xxxx'],
}
Please star this repo if you found it useful.
Feel free to contribute and propose issues & additional features.
License is MIT.