scrapinghub / scrapy-poet

Page Object pattern for Scrapy
BSD 3-Clause "New" or "Revised" License
119 stars 28 forks source link

Error when running scrapy shell #92

Closed Pemh closed 1 year ago

Pemh commented 1 year ago

Running scrapy shell raises an error because no spider is passed to the crawler in Injector instance (scrapy-poet v0.6.0).

$ scrapy shell "http://httpbin.org/anything?json"

2022-11-13 18:31:52 [scrapy.utils.log] INFO: Scrapy 2.7.1 started (bot: prospect_scraper)
2022-11-13 18:31:52 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.14, cssselect 1.2.0, parsel 1.7.0, w3lib 2.0.1, Twisted 22.10.0, Python 3.10.5 (main, Jul 21 2022, 15:29:15) [GCC 9.4.0], pyOpenSSL 22.1.0 (OpenSSL 3.0.7 1 Nov 2022), cryptography 38.0.3, Platform Linux-5.4.0-131-generic-x86_64-with-glibc2.31
2022-11-13 18:31:52 [scrapy.crawler] INFO: Overridden settings:
{'AUTOTHROTTLE_ENABLED': True,
 'BOT_NAME': 'prospect_scraper',
 'CLOSESPIDER_PAGECOUNT': 7,
 'CONCURRENT_REQUESTS': 1,
 'DOWNLOAD_DELAY': 3,
 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
 'HTTPCACHE_ENABLED': True,
 'LOGSTATS_INTERVAL': 0,
 'NEWSPIDER_MODULE': 'prospect_scraper.spiders',
 'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
 'SPIDER_MODULES': ['prospect_scraper.spiders']}
2022-11-13 18:31:52 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-11-13 18:31:52 [scrapy.extensions.telnet] INFO: Telnet Password: rm41x79c5cdfg82b87
2022-11-13 18:31:52 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.throttle.AutoThrottle',
 'scrapy.extensions.closespider.CloseSpider']
2022-11-13 18:31:52 [scrapy_poet.overrides] DEBUG: List of parsed OverrideRules:
[]
Traceback (most recent call last):
  File "/home/pemh/projects/prospect-scraper/.venv/bin/scrapy", line 8, in <module>
    sys.exit(execute())
  File "/home/pemh/projects/prospect-scraper/.venv/lib/python3.10/site-packages/scrapy/cmdline.py", line 154, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/home/pemh/projects/prospect-scraper/.venv/lib/python3.10/site-packages/scrapy/cmdline.py", line 109, in _run_print_help
    func(*a, **kw)
  File "/home/pemh/projects/prospect-scraper/.venv/lib/python3.10/site-packages/scrapy/cmdline.py", line 162, in _run_command
    cmd.run(args, opts)
  File "/home/pemh/projects/prospect-scraper/.venv/lib/python3.10/site-packages/scrapy/commands/shell.py", line 68, in run
    crawler.engine = crawler._create_engine()
  File "/home/pemh/projects/prospect-scraper/.venv/lib/python3.10/site-packages/scrapy/crawler.py", line 130, in _create_engine
    return ExecutionEngine(self, lambda _: self.stop())
  File "/home/pemh/projects/prospect-scraper/.venv/lib/python3.10/site-packages/scrapy/core/engine.py", line 83, in __init__
    self.downloader = downloader_cls(crawler)
  File "/home/pemh/projects/prospect-scraper/.venv/lib/python3.10/site-packages/scrapy/core/downloader/__init__.py", line 83, in __init__
    self.middleware = DownloaderMiddlewareManager.from_crawler(crawler)
  File "/home/pemh/projects/prospect-scraper/.venv/lib/python3.10/site-packages/scrapy/middleware.py", line 60, in from_crawler
    return cls.from_settings(crawler.settings, crawler)
  File "/home/pemh/projects/prospect-scraper/.venv/lib/python3.10/site-packages/scrapy/middleware.py", line 42, in from_settings
    mw = create_instance(mwcls, settings, crawler)
  File "/home/pemh/projects/prospect-scraper/.venv/lib/python3.10/site-packages/scrapy/utils/misc.py", line 167, in create_instance
    instance = objcls.from_crawler(crawler, *args, **kwargs)
  File "/home/pemh/projects/prospect-scraper/.venv/lib/python3.10/site-packages/scrapy_poet/downloadermiddlewares.py", line 62, in from_crawler
    o = cls(crawler)
  File "/home/pemh/projects/prospect-scraper/.venv/lib/python3.10/site-packages/scrapy_poet/downloadermiddlewares.py", line 52, in __init__
    self.injector = Injector(
  File "/home/pemh/projects/prospect-scraper/.venv/lib/python3.10/site-packages/scrapy_poet/injection.py", line 50, in __init__
    self.load_providers(default_providers)
  File "/home/pemh/projects/prospect-scraper/.venv/lib/python3.10/site-packages/scrapy_poet/injection.py", line 56, in load_providers
    **self.spider.settings.getdict("SCRAPY_POET_PROVIDERS"),
AttributeError: 'NoneType' object has no attribute 'settings'

The Injector class:

class Injector:
    """
    Keep all the logic required to do dependency injection in Scrapy callbacks.
    Initializes the providers from the spider settings at initialization.
    """

    def __init__(
        self,
        crawler: Crawler,
        *,
        default_providers: Optional[Mapping] = None,
        overrides_registry: Optional[OverridesRegistryBase] = None,
    ):
        self.crawler = crawler
        self.spider = crawler.spider   # the value is None when running scrapy shell
        self.overrides_registry = overrides_registry or OverridesRegistry()
        self.load_providers(default_providers)
        self.init_cache()

    def load_providers(self, default_providers: Optional[Mapping] = None):  # noqa: D102
        providers_dict = {
            **(default_providers or {}),
            **self.spider.settings.getdict("SCRAPY_POET_PROVIDERS"),  # attribute error because self.spider is None.
        }
    ...

When unabling scrapy_poet.InjectionMiddleware and running scrapy shell <url> the crawler parameter initializing the Injector provided by InjectionMiddleware does not contain a spider.

The command scrapy shell <url> is supposed to create several objects including a spider: "the Spider which is known to handle the URL, or a Spider object if there is no spider found for the current URL." source

Providing a valid spider with the --spider option raises the same error:

$ scrapy shell --spider <spider name> "http://httpbin.org/anything?json"

This error is not raised when running scrapy crawl.

Gallaecio commented 1 year ago

I can confirm this issue, @VMRuiz reported this to me a few days ago.

Whether it is an issue on scrapy-poet or on Scrapy, I do not know.

BurnzZ commented 1 year ago

Thanks for reporting. I think I found the problem.

@Pemh can you try installing and using scrapy-poet from the branch of this PR to see if it fully works on your end? https://github.com/scrapinghub/scrapy-poet/pull/94

Thanks

Pemh commented 1 year ago

It works, well done! Should I close the issue myself?

Gallaecio commented 1 year ago

Should I close the issue myself?

No need, we will close it as we merge #94.