scrapy / scrapyd

A service daemon to run Scrapy spiders
https://scrapyd.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
2.97k stars 569 forks source link

Scrapyd does not support spiders that use AsyncioSelectorReactor #377

Closed sseveran closed 2 years ago

sseveran commented 4 years ago

Currently scrapyd does not support spiders that use asyncio coroutines. When you upload the spider to scrapyd it fails with the following error. I didn't see a way to override the twisted reactor implementation in scrapyd.

scrapyd_1        | 2020-05-16T13:56:19+0000 [_GenericHTTPChannelProtocol,0,172.18.0.1] Unhandled Error
scrapyd_1        |  Traceback (most recent call last):
scrapyd_1        |    File "/usr/local/lib/python3.6/dist-packages/twisted/web/http.py", line 2284, in allContentReceived
scrapyd_1        |      req.requestReceived(command, path, version)
scrapyd_1        |    File "/usr/local/lib/python3.6/dist-packages/twisted/web/http.py", line 946, in requestReceived
scrapyd_1        |      self.process()
scrapyd_1        |    File "/usr/local/lib/python3.6/dist-packages/twisted/web/server.py", line 235, in process
scrapyd_1        |      self.render(resrc)
scrapyd_1        |    File "/usr/local/lib/python3.6/dist-packages/twisted/web/server.py", line 302, in render
scrapyd_1        |      body = resrc.render(self)
scrapyd_1        |  --- <exception caught here> ---
scrapyd_1        |    File "/usr/local/lib/python3.6/dist-packages/scrapyd/webservice.py", line 21, in render
scrapyd_1        |      return JsonResource.render(self, txrequest).encode('utf-8')
scrapyd_1        |    File "/usr/local/lib/python3.6/dist-packages/scrapyd/utils.py", line 20, in render
scrapyd_1        |      r = resource.Resource.render(self, txrequest)
scrapyd_1        |    File "/usr/local/lib/python3.6/dist-packages/twisted/web/resource.py", line 265, in render
scrapyd_1        |      return m(request)
scrapyd_1        |    File "/usr/local/lib/python3.6/dist-packages/scrapyd/webservice.py", line 88, in render_POST
scrapyd_1        |      spiders = get_spider_list(project, version=version)
scrapyd_1        |    File "/usr/local/lib/python3.6/dist-packages/scrapyd/utils.py", line 134, in get_spider_list
scrapyd_1        |      raise RuntimeError(msg.encode('unicode_escape') if six.PY2 else msg)
scrapyd_1        |  builtins.RuntimeError: /usr/local/lib/python3.6/dist-packages/scrapy/utils/project.py:94: ScrapyDeprecationWarning: Use of environment variables prefixed with SCRAPY_ to override settings is deprecated. The following environment variables are currently defined: EGG_VERSION
scrapyd_1        |    ScrapyDeprecationWarning
scrapyd_1        |  Traceback (most recent call last):
scrapyd_1        |    File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
scrapyd_1        |      "__main__", mod_spec)
scrapyd_1        |    File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
scrapyd_1        |      exec(code, run_globals)
scrapyd_1        |    File "/usr/local/lib/python3.6/dist-packages/scrapyd/runner.py", line 40, in <module>
scrapyd_1        |      main()
scrapyd_1        |    File "/usr/local/lib/python3.6/dist-packages/scrapyd/runner.py", line 37, in main
scrapyd_1        |      execute()
scrapyd_1        |    File "/usr/local/lib/python3.6/dist-packages/scrapy/cmdline.py", line 144, in execute
scrapyd_1        |      cmd.crawler_process = CrawlerProcess(settings)
scrapyd_1        |    File "/usr/local/lib/python3.6/dist-packages/scrapy/crawler.py", line 265, in __init__
scrapyd_1        |      super(CrawlerProcess, self).__init__(settings)
scrapyd_1        |    File "/usr/local/lib/python3.6/dist-packages/scrapy/crawler.py", line 141, in __init__
scrapyd_1        |      self._handle_twisted_reactor()
scrapyd_1        |    File "/usr/local/lib/python3.6/dist-packages/scrapy/crawler.py", line 329, in _handle_twisted_reactor
scrapyd_1        |      super()._handle_twisted_reactor()
scrapyd_1        |    File "/usr/local/lib/python3.6/dist-packages/scrapy/crawler.py", line 237, in _handle_twisted_reactor
scrapyd_1        |      verify_installed_reactor(self.settings["TWISTED_REACTOR"])
scrapyd_1        |    File "/usr/local/lib/python3.6/dist-packages/scrapy/utils/reactor.py", line 77, in verify_installed_reactor
scrapyd_1        |      raise Exception(msg)
scrapyd_1        |  Exception: The installed reactor (twisted.internet.epollreactor.EPollReactor) does not match the requested one (twisted.internet.asyncioreactor.AsyncioSelectorReactor)

The twisted application runner logs the default reactor for the platform when it starts up so I think that we would need a way to load a reactor before calling run. However I am not a twisted expert and that is just my guess having stepped through the code.

scrapyd_1        | 2020-05-16T13:56:10+0000 [twisted.scripts._twistd_unix.UnixAppLogger#info] twistd 20.3.0 (/usr/bin/python3 3.6.9) starting up.
scrapyd_1        | 2020-05-16T13:56:10+0000 [twisted.scripts._twistd_unix.UnixAppLogger#info] reactor class: twisted.internet.epollreactor.EPollReactor.
inakrin commented 4 years ago

Is there a workaround for it?

sseveran commented 4 years ago

I didn't find one. I built my own solution with docker, cron and some notebooks.

inakrin commented 4 years ago

@sseveran I've just found one. In the scrapy code which is used by the scrapyd instance edit the file runner.py (for me the path was /opt/virtualenv/lib/python3.8/site-packages/scrapyd/runner.py) Just below all the existing import add this code: from scrapy.utils.reactor import install_reactor install_reactor('twisted.internet.asyncioreactor.AsyncioSelectorReactor') I'm not yet sure how the scrapers behave with that setting but so far I was able to deploy them, launch and scrape a few items

namiwa commented 3 years ago

@inakrin to clarify, must this be done within the libraries source code, or can a custom runner be specified from scrapyd's default config:

image

inakrin commented 3 years ago

@namiwa I wasn't aware of this setting so I haven't tried. But now I believe that using this setting is much better than editing the source code of the library. Thank you for the hint!

VitalyVen commented 3 years ago

runner-based workaround implementation https://github.com/VitalyVen/scrapy-cookiecutter/commit/9ab0105bb3355a967f0a27012a8dd14d08928e77.

namiwa commented 3 years ago

@inakrin happy to help! and thanks @VitalyVen for the possible implementation

Upon further inspection, it seems that scrapyd has a reactor flag at launch, with the following options:

image

perhaps a complete approach would be run scrapyd --reactor=asyncio:

inked_mistake

along with the scrapyd.runner override

sudoexec commented 3 years ago

@inakrin happy to help! and thanks @VitalyVen for the possible implementation

Upon further inspection, it seems that scrapyd has a reactor flag at launch, with the following options:

image

perhaps a complete approach would be run scrapyd --reactor=asyncio:

inked_mistake

along with the scrapyd.runner override

It doesn't work. Although the log shows that it use twisted.internet.asyncioreactor.AsyncioSelectorReactor. When you request /schedule endpoint, the same error occurs. Does anyone have any solutions?

namiwa commented 3 years ago

Hi @u23a, trying running my fork on https://github.com/namiwa/scrapyd-authenticated, which has a simple example AsyncIO Reactor-based spider with docker-compose

Hope the above implementation helps!

sudoexec commented 3 years ago

@namiwa Thanks for your reply, I'll try it.