scrapinghub / frontera

A scalable frontier for web crawlers
BSD 3-Clause "New" or "Revised" License
1.3k stars 217 forks source link

SQLAlchemyBackend does not go with scrapy_splash #232

Open dingld opened 7 years ago

dingld commented 7 years ago

Args for splash is dumped with json in SplashMiddleware.

     body = json.dumps(args, ensure_ascii=False, sort_keys=True, indent=4)

However, the QueueModel used by sqlalchemy backend does not record the field about request.body.

One possible solution is to add a new field and the process of storing and accessing to request. After so, frontera on my pc seems to function well with scrapy_splash except for that url/domain fprintmws needs replacing.

class QueueModelMixin(object):
    __table_args__ = (
        {
            'mysql_charset': 'utf8',
            'mysql_engine': 'InnoDB',
            'mysql_row_format': 'DYNAMIC',
        },
    )

    id = Column(Integer, primary_key=True)
    partition_id = Column(Integer, index=True)
    score = Column(Float, index=True)
    url = Column(String(1024), nullable=False)
    fingerprint = Column(String(40), nullable=False)
    host_crc32 = Column(Integer, nullable=False)
    meta = Column(PickleType())
    # to add body=Column(String(1024))
    headers = Column(PickleType())
    cookies = Column(PickleType())
    method = Column(String(6))
    created_at = Column(BigInteger, index=True)
    depth = Column(SmallInteger)
sibiryakov commented 7 years ago

Yeah, I absolutely agree with adding this field.

kmike commented 7 years ago

I'm not sure String(1024) is enough. Body is required to handle POST or PUT requests properly, this is not specific to scrapy-splash. Also, request bodies are binary, not strings. So something like LargeBinary (a blob) looks like a better fit.

dingld commented 7 years ago

As for the issue with scrapy_splash, I found a relatively simple solution. Just like how frontera.contrib.scrapy.schedulers.frontier.FronteraScheduler deals with redirected request, the request to SPLASH_URL can be cached to the pending queue rather than persisted to backend

    def enqueue_request(self, request):
        if not self._request_is_redirected(request):
            self.frontier.add_seeds([request])
            self.stats_manager.add_seeds()
            return True
        elif self.redirect_enabled:
            self._add_pending_request(request)
            self.stats_manager.add_redirected_requests()
            return True
        return False

The possible solution would be like below

    def __init__(self, crawler, manager=None):
      self.settings = crawler.settings

    def enqueue_request(self, request):
        #  add scheduler support for splash request avoid sending to backend.
        splash_url = self.settings.get('SPLASH_URL')
        if splash_url and splash_url in request.url:
            self._add_pending_request(request)
            self.logger.info('Recycle SplashRequest to pending queue')
            return True
        elif not self._request_is_redirected(request):
            self.frontier.add_seeds([request])
            self.stats_manager.add_seeds()
            return True
        elif self.redirect_enabled:
            self._add_pending_request(request)
            self.stats_manager.add_redirected_requests()
            return True
        return False

It would save the job to customize sqlalchemy model and fingerprint module. It seems to work fine on my pc. (frontera-0.7.0, scrapy-1.2.2)

sibiryakov commented 7 years ago

@dingld the only cons is that will not survive process restart, but for some applications this isn't necessary. For a general purpose solution I would extend SQLA backend with the fields needed. Anyone would like to make a PR?

MuhammadRahman-awin commented 7 years ago

Hi @sibiryakov I have overridden the FronteraScheduler to adapt changes suggested by @dingld to make my splash request work. However I didn't understand your comment. Would take some moment to explain that please?

Thanks.