open-contracting / kingfisher-collect

Downloads OCDS data and stores it on disk
https://kingfisher-collect.readthedocs.io
BSD 3-Clause "New" or "Revised" License
13 stars 12 forks source link

Scrapy 2.6.1 bugs #918

Closed yolile closed 2 years ago

yolile commented 2 years ago

https://github.com/scrapy/scrapy/issues/5437 This bug is also affecting us, for example, when using the sample mode:

scrapy crawl uruguay_releases -a sample=1

Although the data is downloaded, there is an annoying exception when the spider is closed:

2022-04-05 11:06:00 [scrapy.core.engine] INFO: Spider closed (sample)
2022-04-05 11:06:00 [scrapy.core.engine] INFO: Error while scheduling new request
Traceback (most recent call last):
  File "kingfisher-collect/.ve/lib/python3.8/site-packages/twisted/internet/task.py", line 526, in _oneWorkUnit
    result = next(self._iterator)
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "kingfisher-collect/.ve/lib/python3.8/site-packages/twisted/internet/defer.py", line 857, in _runCallbacks
    current.result = callback(  # type: ignore[misc]
  File "kingfisher-collect/.ve/lib/python3.8/site-packages/scrapy/core/engine.py", line 187, in <lambda>
    d.addBoth(lambda _: self.slot.nextcall.schedule())
AttributeError: 'NoneType' object has no attribute 'nextcall'
jpmckinney commented 2 years ago

I think we can wait for 2.6.2, right?

yolile commented 2 years ago

We are also being affected by this other bug https://github.com/scrapy/scrapy/issues/5435 E.g.

scrapy pluck --release-pointer /date

Traceback (most recent call last):
  File "kingfisher-collect/.ve/bin/scrapy", line 8, in <module>
    sys.exit(execute())
  File "kingfisher-collect/.ve/lib/python3.6/site-packages/scrapy/cmdline.py", line 145, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "kingfisher-collect/.ve/lib/python3.6/site-packages/scrapy/cmdline.py", line 100, in _run_print_help
    func(*a, **kw)
  File "kingfisher-collect/.ve/lib/python3.6/site-packages/scrapy/cmdline.py", line 153, in _run_command
    cmd.run(args, opts)
  File "kingfisher-collect/kingfisher_scrapy/commands/pluck.py", line 59, in run
    release_pointer=opts.release_pointer, truncate=opts.truncate)
  File "kingfisher-collect/.ve/lib/python3.6/site-packages/scrapy/crawler.py", line 205, in crawl
    crawler = self.create_crawler(crawler_or_spidercls)
  File "kingfisher-collect/.ve/lib/python3.6/site-packages/scrapy/crawler.py", line 238, in create_crawler
    return self._create_crawler(crawler_or_spidercls)
  File "kingfisher-collect/.ve/lib/python3.6/site-packages/scrapy/crawler.py", line 313, in _create_crawler
    return Crawler(spidercls, self.settings, init_reactor=True)
  File "kingfisher-collect/.ve/lib/python3.6/site-packages/scrapy/crawler.py", line 82, in __init__
    default.install()
  File "kingfisher-collect/.ve/lib/python3.6/site-packages/twisted/internet/epollreactor.py", line 246, in install
    installReactor(p)
  File "kingfisher-collect/.ve/lib/python3.6/site-packages/twisted/internet/main.py", line 32, in installReactor
    raise error.ReactorAlreadyInstalledError("reactor already installed")
twisted.internet.error.ReactorAlreadyInstalledError: reactor already installed
yolile commented 2 years ago

I think we can wait for 2.6.2, right?

But as I run that command locally, I guess I can also change the dependency locally and wait until scrapy 2.6.2 is released,

jpmckinney commented 2 years ago

https://github.com/scrapy/scrapy/issues/5496

yolile commented 2 years ago

Hmm, we are having an error in the registry scrapyd now:

2022-05-06T14:20:07+0000 [-] Process started:  project='kingfisher' spider='chile_compra_api_releases' job='9efca674cd4711ecbf91a8a159689b50' pid=4270 log='/home/collect/scrapyd/logs/kingfisher/chile_compra_api_releases/9efca674cd4711ecbf91a8a159689b50.log' items=None
2022-05-06T14:20:07+0000 [Launcher,4270/stderr] /home/collect/scrapyd/.ve/lib/python3.8/site-packages/scrapy/utils/project.py:81: ScrapyDeprecationWarning: Use of environment variables prefixed with SCRAPY_ to override settings is deprecated. The following environment variables are currently defined: JOB, LOG_FILE, SLOT, SPIDER
      warnings.warn(
2022-05-06T14:20:07+0000 [Launcher,4270/stderr] Traceback (most recent call last):
      File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
        return _run_code(code, main_globals, None,
      File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
        exec(code, run_globals)
      File "/home/collect/scrapyd/.ve/lib/python3.8/site-packages/scrapyd/runner.py", line 40, in <module>
        main()
      File "/home/collect/scrapyd/.ve/lib/python3.8/site-packages/scrapyd/runner.py", line 37, in main
        execute()
      File "/home/collect/scrapyd/.ve/lib/python3.8/site-packages/scrapy/cmdline.py", line 145, in execute
        _run_print_help(parser, _run_command, cmd, args, opts)
      File "/home/collect/scrapyd/.ve/lib/python3.8/site-packages/scrapy/cmdline.py", line 100, in _run_print_help
        func(*a, **kw)
      File "/home/collect/scrapyd/.ve/lib/python3.8/site-packages/scrapy/cmdline.py", line 153, in _run_command
        cmd.run(args, opts)
      File "/home/collect/scrapyd/.ve/lib/python3.8/site-packages/scrapy/commands/crawl.py", line 22, in run
        crawl_defer = self.crawler_process.crawl(spname, **opts.spargs)
      File "/home/collect/scrapyd/.ve/lib/python3.8/site-packages/scrapy/crawler.py", line 205, in crawl
        crawler = self.create_crawler(crawler_or_spidercls)
      File "/home/collect/scrapyd/.ve/lib/python3.8/site-packages/scrapy/crawler.py", line 238, in create_crawler
        return self._create_crawler(crawler_or_spidercls)
      File "/home/collect/scrapyd/.ve/lib/python3.8/site-packages/scrapy/crawler.py", line 313, in _create_crawler
        return Crawler(spidercls, self.settings, init_reactor=True)
      File "/home/collect/scrapyd/.ve/lib/python3.8/site-packages/scrapy/crawler.py", line 82, in __init__
        default.install()
      File "/home/collect/scrapyd/.ve/lib/python3.8/site-packages/twisted/internet/epollreactor.py", line 256, in install
        installReactor(p)
      File "/home/collect/scrapyd/.ve/lib/python3.8/site-packages/twisted/internet/main.py", line 32, in installReactor
        raise error.ReactorAlreadyInstalledError("reactor already installed")
    twisted.internet.error.ReactorAlreadyInstalledError: reactor already installed
2022-05-06T14:20:07+0000 [-] Process died: exitstatus=1  project='kingfisher' spider='chile_compra_api_releases' job='9efca674cd4711ecbf91a8a159689b50' pid=4270 log='/home/collect/scrapyd/logs/kingfisher/chile_compra_api_releases/9efca674cd4711ecbf91a8a159689b50.log' items=None
yolile commented 2 years ago

In the kingfisher server, scrapyd is working well, both are using scrapy 2.6.1, the difference is that kingfisher is using Twisted 20.3.0, and the registry Twisted 22.4.0 (the latest one)

jpmckinney commented 2 years ago

Let's update the Registry's deployment then?

yolile commented 2 years ago

Do you mean as a temporary fix? Otherwise, we will need to change the requirements file each time we deploy the registry (or kingfisher)

jpmckinney commented 2 years ago

I thought you were saying the problem is that Twisted is old on the Registry. If so, we should update it (and use the same versions on both servers).

If you mean we need to downgrade Scrapy, then downgrade it on both servers until 2.6.2 is released.

yolile commented 2 years ago

In the kingfisher server, scrapyd is working well, both are using scrapy 2.6.1, the difference is that kingfisher is using Twisted 20.3.0, and the registry Twisted 22.4.0 (the latest one)

Nop, the registry is using the latest one (22.4.0) and kingfisher an older version (20.3.0)

jpmckinney commented 2 years ago

Okay, can't we get them to both use the same one?

I upgraded to 22.4.0 in c97d8cbe6 because there was a security warning.

jpmckinney commented 2 years ago

Okay, so the twisted.internet.error.ReactorAlreadyInstalledError error is not related to Twisted, but to Scrapy 2.6.x.

We can either downgrade Scrapy to 2.5 (as I suggested), or we can use the HEAD from GitHub.

jpmckinney commented 2 years ago

Noting that we also needed to upgrade Scrapyd to 1.3.0.