scrapinghub / scrapyrt

HTTP API for Scrapy spiders
BSD 3-Clause "New" or "Revised" License
824 stars 161 forks source link

Use another reactor instead of default `EPollReactor` #132

Closed avlm closed 2 years ago

avlm commented 2 years ago

Hello! I'm using scrapyrt for about a month now and it works great, but today I added a new spider to project that uses scrapy-playwright project.

For this to work I had to change the default scrapy reactor to twisted.internet.asyncioreactor.AsyncioSelectorReactor, but when I run the scrapyrt server and make a request to run this spider it breaks with this error:

builtins.Exception: The installed reactor (twisted.internet.epollreactor.EPollReactor) does not match the requested one (twisted.internet.asyncioreactor.AsyncioSelectorReactor)

Is it possible to make scrapyrt use the reactor specified in the scrapy settings?

[EDIT] Sorry, didn't pay enough attention to readme, so opening a new issue with the right label

Old issue #131

avlm commented 2 years ago

Update here. I think I found the problem.

As documented on installing the asyncio reactor on scrapy docs, to use CrawlerRunner we need to install the asyncio reactor manually using install_reactor(twisted.internet.asyncioreactor.AsyncioSelectorReactor).

So I went to scrapyrt code and added the try/except block to scrapyrt/core.py

from scrapy.utils.reactor import install_reactor, verify_installed_reactor

class ScrapyrtCrawlerProcess(CrawlerRunner):

    def __init__(self, settings, scrapyrt_manager):
        super(ScrapyrtCrawlerProcess, self).__init__(settings)
        try:
            verify_installed_reactor(settings.TWISTED_REACTOR)
        except Exception:
            install_reactor(settings.TWISTED_REACTOR)
        self.scrapyrt_manager = scrapyrt_manager

verify_installed_reactor() raises that exception I'm getting when making a request to my scrapyrt server. But probably this code isn't in the right place, I tested it in the quotesbot and having the same error.

I'll keep trying, appreciate any help. Thanks.

waxz commented 2 years ago

paste this at top of cmdline.py

# -*- coding: utf-8 -*-
from configparser import (
    ConfigParser, NoOptionError, NoSectionError
)
import argparse
import os
import sys

# asyncio reactor installation (CORRECT) - `reactor` must not be defined at this point
# https://docs.scrapy.org/en/latest/_modules/scrapy/utils/reactor.html?highlight=asyncio%20reactor#
import scrapy
import asyncio
from twisted.internet import asyncioreactor
scrapy.utils.reactor.install_reactor('twisted.internet.asyncioreactor.AsyncioSelectorReactor')
scrapy.utils.reactor.verify_installed_reactor('twisted.internet.asyncioreactor.AsyncioSelectorReactor')
is_asyncio_reactor_installed = scrapy.utils.reactor.is_asyncio_reactor_installed()
print(f"Is asyncio reactor installed: {is_asyncio_reactor_installed}")

from twisted.internet import reactor

reference:https://stackoverflow.com/questions/61997049/apschedulerscrapyasyncio-cant-execute-first-task-smoothly

if you want to use playwright screenshot, you need post

curl localhost:9081/crawl.json     -d '{"request":{"url":"http://www.google.com/", "meta": { "playwright": "True", "playwright_context": "new", "playwright_include_page": "True",}}, "spider_name": "playwright"}'
pawelmhm commented 2 years ago

hey @avlm thanks for posting this information. We didn't try to use ScrapyRT with other reactors, by default it uses twisted reactor to run Twisted web server, and Scrapy by default also uses Twisted reactor. To override this we'd have to adjust it somewhere here: https://github.com/scrapinghub/scrapyrt/blob/a3bf17f02297215a7fc5766f1f7e1b24d165562c/scrapyrt/cmdline.py#L93 so very early.

Scrapy goes into direction of allowing different reactors and I think ScrapyRT should allow it too, we would have to review what needs to be updated and respect this setting here: https://github.com/scrapy-plugins/scrapy-playwright#configuration TWISTED_REACTOR we should probably use reactor that is installed in user settings instead of just assuming default.

I added "help wanted" tag to invite others to do more research into this and post their findings here, I'll do my research too and will post it later. Definitely supporting different reactors is something that would be welcome feature in future releases.

pawelmhm commented 2 years ago

fixed by https://github.com/scrapinghub/scrapyrt/pull/135