Open dingld opened 7 years ago
Yeah, I absolutely agree with adding this field.
I'm not sure String(1024)
is enough. Body is required to handle POST or PUT requests properly, this is not specific to scrapy-splash. Also, request bodies are binary, not strings. So something like LargeBinary (a blob) looks like a better fit.
As for the issue with scrapy_splash, I found a relatively simple solution. Just like how frontera.contrib.scrapy.schedulers.frontier.FronteraScheduler deals with redirected request, the request to SPLASH_URL can be cached to the pending queue rather than persisted to backend
def enqueue_request(self, request):
if not self._request_is_redirected(request):
self.frontier.add_seeds([request])
self.stats_manager.add_seeds()
return True
elif self.redirect_enabled:
self._add_pending_request(request)
self.stats_manager.add_redirected_requests()
return True
return False
The possible solution would be like below
def __init__(self, crawler, manager=None):
self.settings = crawler.settings
def enqueue_request(self, request):
# add scheduler support for splash request avoid sending to backend.
splash_url = self.settings.get('SPLASH_URL')
if splash_url and splash_url in request.url:
self._add_pending_request(request)
self.logger.info('Recycle SplashRequest to pending queue')
return True
elif not self._request_is_redirected(request):
self.frontier.add_seeds([request])
self.stats_manager.add_seeds()
return True
elif self.redirect_enabled:
self._add_pending_request(request)
self.stats_manager.add_redirected_requests()
return True
return False
It would save the job to customize sqlalchemy model and fingerprint module. It seems to work fine on my pc. (frontera-0.7.0, scrapy-1.2.2)
@dingld the only cons is that will not survive process restart, but for some applications this isn't necessary. For a general purpose solution I would extend SQLA backend with the fields needed. Anyone would like to make a PR?
Hi @sibiryakov
I have overridden the FronteraScheduler
to adapt changes suggested by @dingld to make my splash request work. However I didn't understand your comment.
Would take some moment to explain that please?
Thanks.
Args for splash is dumped with json in SplashMiddleware.
However, the QueueModel used by sqlalchemy backend does not record the field about request.body.
One possible solution is to add a new field and the process of storing and accessing to request. After so, frontera on my pc seems to function well with scrapy_splash except for that url/domain fprintmws needs replacing.