scrapinghub / scrapy-poet

Page Object pattern for Scrapy
BSD 3-Clause "New" or "Revised" License
119 stars 28 forks source link

Couldn't create custom Provider #86

Open suspectinside opened 2 years ago

suspectinside commented 2 years ago

Hi, just sample setup:

# ================= Providers pom/page_input_providers/providers.py
import logging
from collections.abc import Callable, Sequence
from scrapy_poet.page_input_providers import PageObjectInputProvider
from scrapy.settings import Settings

logger = logging.getLogger()
logger.setLevel(logging.INFO)

class Arq:
    async def enqueue_task(task: dict):
        logger.info('Arq.enqueue_task() enqueueing new task: %r', task)

class ArqProvider(PageObjectInputProvider):
    provided_classes = {Arq}
    name = 'ARQ_PROVIDER'

    async def __call__(self, to_provide: set[Callable]) -> Sequence[Callable]:
        return [Arq()]
# ================= Page Object Models
import attr
from web_poet.pages import Injectable, WebPage, ItemWebPage
from pom.page_input_providers.providers import Arq

@attr.define
class IndexPage(WebPage):
    arq: Arq

    @property
    async def page_titles(self):
        await self.arq.enqueue_task({'bla': 'bla!'})

        return [
            (el.attrib['href'], el.css('::text').get())
            for el in self.css('.selected a.reference.external')
        ]

Injectable entity - arq: Arq. So, i'd like to work with arq instance here.

# ================= the Spider
import uvloop, asyncio, pprint, logging
import scrapy
from scrapy.utils.reactor import install_reactor
from scrapy.http import HtmlResponse
from pom.util import stop_logging, wait
from pom.poms.pages import IndexPage
from pom.page_input_providers.providers import ArqProvider

import web_poet as wp

from scrapy_poet.page_input_providers import HttpClientProvider, PageParamsProvider

stop_logging()
uvloop.install()
install_reactor('twisted.internet.asyncioreactor.AsyncioSelectorReactor', 'uvloop.Loop')

logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)

# ================= Actual Spider Code:

 TitlesLocalSpider(scrapy.Spider):
    name = 'titles.local'
    start_urls = ['http://localhost:8080/orm/join_conditions.html']

    custom_settings = {
        'SCRAPY_POET_PROVIDERS': {
            ArqProvider: 500,    # MY PROVIDER FOR INJECTABLE arq: Arq
        },
    }

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        stop_logging()
        logger.info('=' * 30)
        return super().from_crawler(crawler, *args, **kwargs)

    async def parse(self, response, index_page: IndexPage, **kwargs):
        self.logger.info(await index_page.page_titles)

and i got the error like this:

Unhandled error in Deferred:

Traceback (most recent call last):
  File "~/.venv/lib/python3.10/site-packages/scrapy/crawler.py", line 205, in crawl
    return self._crawl(crawler, *args, **kwargs)
  File "~/.venv/lib/python3.10/site-packages/scrapy/crawler.py", line 209, in _crawl
    d = crawler.crawl(*args, **kwargs)
  File "~/.venv/lib/python3.10/site-packages/twisted/internet/defer.py", line 1946, in unwindGenerator
    return _cancellableInlineCallbacks(gen)
  File "~/.venv/lib/python3.10/site-packages/twisted/internet/defer.py", line 1856, in _cancellableInlineCallbacks
    _inlineCallbacks(None, gen, status, _copy_context())
--- <exception caught here> ---
  File "~/.venv/lib/python3.10/site-packages/twisted/internet/defer.py", line 1696, in _inlineCallbacks
    result = context.run(gen.send, result)
  File "~/.venv/lib/python3.10/site-packages/scrapy/crawler.py", line 101, in crawl
    self.engine = self._create_engine()
  File "~/.venv/lib/python3.10/site-packages/scrapy/crawler.py", line 115, in _create_engine
    return ExecutionEngine(self, lambda _: self.stop())
  File "~/.venv/lib/python3.10/site-packages/scrapy/core/engine.py", line 83, in __init__
    self.downloader = downloader_cls(crawler)
  File "~/.venv/lib/python3.10/site-packages/scrapy/core/downloader/__init__.py", line 83, in __init__
    self.middleware = DownloaderMiddlewareManager.from_crawler(crawler)
  File "~/.venv/lib/python3.10/site-packages/scrapy/middleware.py", line 59, in from_crawler
    return cls.from_settings(crawler.settings, crawler)
  File "~/.venv/lib/python3.10/site-packages/scrapy/middleware.py", line 41, in from_settings
    mw = create_instance(mwcls, settings, crawler)
  File "~/.venv/lib/python3.10/site-packages/scrapy/utils/misc.py", line 166, in create_instance
    instance = objcls.from_crawler(crawler, *args, **kwargs)
  File "~/.venv/lib/python3.10/site-packages/scrapy_poet/downloadermiddlewares.py", line 62, in from_crawler
    o = cls(crawler)
  File "~/.venv/lib/python3.10/site-packages/scrapy_poet/downloadermiddlewares.py", line 52, in __init__
    self.injector = Injector(
  File "~/.venv/lib/python3.10/site-packages/scrapy_poet/injection.py", line 50, in __init__
    self.load_providers(default_providers)
  File "~/.venv/lib/python3.10/site-packages/scrapy_poet/injection.py", line 63, in load_providers
    self.is_provider_requiring_scrapy_response = {
  File "~/.venv/lib/python3.10/site-packages/scrapy_poet/injection.py", line 64, in <dictcomp>
    provider: is_provider_requiring_scrapy_response(provider)
  File "~/.venv/lib/python3.10/site-packages/scrapy_poet/injection.py", line 348, in is_provider_requiring_scrapy_response
    plan = andi.plan(
  File "~/.venv/lib/python3.10/site-packages/andi/andi.py", line 303, in plan
    plan, _ = _plan(class_or_func,
  File "~/.venv/lib/python3.10/site-packages/andi/andi.py", line 341, in _plan
    sel_cls, arg_overrides = _select_type(
  File "~/.venv/lib/python3.10/site-packages/andi/andi.py", line 395, in _select_type
    if is_injectable(candidate) or externally_provided(candidate):
  File "~/.venv/lib/python3.10/site-packages/web_poet/pages.py", line 34, in is_injectable
    return isinstance(cls, type) and issubclass(cls, Injectable)
  File "/usr/lib/python3.10/abc.py", line 123, in __subclasscheck__
    return _abc_subclasscheck(cls, subclass)
builtins.TypeError: issubclass() arg 1 must be a class

So, could you pls explain why this error happens and how to fix it?

BurnzZ commented 2 years ago

Hi @suspectinside , I'm not able to reproduce this locally, as the following minimal code derived from your example runs okay on my end.

I suspect that there's something else outside of your code example that causes this issue. Unfortunately, the logs you've noted doesn't exactly pinpoint the problem.

Could you try out copying the code below to 3 different modules in your project to see if it works?

# providers.py

import logging
from typing import Set
from collections.abc import Callable

from scrapy_poet.page_input_providers import PageObjectInputProvider

logger = logging.getLogger()

class Arq:
    async def enqueue_task(self, task: dict):
        logger.info('Arq.enqueue_task() enqueueing new task: %r', task)

class ArqProvider(PageObjectInputProvider):
    provided_classes = {Arq}
    name = 'ARQ_PROVIDER'

    async def __call__(self, to_provide: Set[Callable]):
        return [Arq()]
# pageobjects.py

import attr

from web_poet.pages import Injectable, WebPage, ItemWebPage
from .providers import Arq

@attr.define
class IndexPage(WebPage):
    arq: Arq

    async def page_titles(self):
        await self.arq.enqueue_task({'bla': 'bla!'})

        return [
            (el.attrib['href'], el.css('::text').get())
            for el in self.css('.selected a.reference.external')
        ]
# spiders/title_spider.py

import scrapy
from ..pageobjects import IndexPage
from ..providers import ArqProvider

class TitlesLocalSpider(scrapy.Spider):
    name = 'titles.local'
    start_urls = ["https://books.toscrape.com"]

    custom_settings = {
        "SCRAPY_POET_PROVIDERS": {
            ArqProvider: 600,  # MY PROVIDER FOR INJECTABLE arq: Arq
        },
        "DOWNLOADER_MIDDLEWARES": {
            "scrapy_poet.InjectionMiddleware": 543,
        },
    }

    async def parse(self, response, index_page: IndexPage):
        self.logger.info(await index_page.page_titles)
# ... omitted log lines
2022-09-05 11:57:31 [scrapy.core.engine] INFO: Spider opened
2022-09-05 11:57:31 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-09-05 11:57:31 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-09-05 11:57:34 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://books.toscrape.com/robots.txt> (referer: None)
2022-09-05 11:57:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com> (referer: None)
2022-09-05 11:57:35 [root] INFO: Arq.enqueue_task() enqueueing new task: {'bla': 'bla!'}
2022-09-05 11:57:35 [titles.local] INFO: []
2022-09-05 11:57:35 [scrapy.core.engine] INFO: Closing spider (finished)
# ... omitted log lines
Gallaecio commented 2 years ago

Could **kwargs in parse be the cause?

BurnzZ commented 2 years ago

I've tried adding the **kwargs but it wasn't enough to cause the same issue.

suspectinside commented 2 years ago

Yep! Thanks a lot, I could find the source of the problem - it happens if i use new builtins.set (with generics support) instead of depricated (since 3.9) typing.Set

so, if i change __call__'s decl from this one:

async def __call__(self, to_provide: set[Callable], settings: Settings) -> Sequence[Callable]:

into smth like this:

from typing import Set
# ...
async def __call__(self, to_provide: Set[Callable], settings: Settings) -> Sequence[Callable]:

everything works correctly.

by the way, collections.abc.Set doesn't work too, from the other hand the Python team has depricated all that typing.{Set, Dict, List etc} guys due to builtins or collections.abc.* support instead, and may be it would be correct to add them into IoC engine too?

in any case, Scrapy-poet(Web-poet) is one of the best approaches i've ever seen and combinations of IoC and Page Object Model pattern for scrapping really shines! thanks a lot for it ;)

suspectinside commented 2 years ago

...and just another one quick question: what's the best (more correct) way to provide Singleton object instance using scrapy-poet IoC infrastructure ? let's say that abovementioned Arq should be a singleton service provider, what is the best way to return it from __call__ method in this case (may i configure IoC cntr somewhere or smth like that?)

BurnzZ commented 2 years ago

I see, great catch! I believe we can use the typing module as a short-term workaround since PEP 585 mentions:

The deprecated functionality will be removed from the typing module in the first Python version released 5 years after the release of Python 3.9.0.

I'm not quite sure how large of an undertaking would it be to completely move to the builtins since web-poet and scrapy-poet still supports 3.7 and 3.8. I'm guessing that if we drop support for them when they lose Python support, the switch would be much easier.

in any case, Scrapy-poet(Web-poet) is one of the best approaches i've ever seen and combinations of IoC and Page Object Model pattern for scrapping really shines! thanks a lot for it ;)

💖 That'd be @kmike's work for you :)

what's the best (more correct) way to provide Singleton object instance using scrapy-poet IoC infrastructure ?

Lot's of approaches on this one but I think the most convenient one is to assign it as a class variable in the provider itself. Technically, it's not a true singleton in this case since the Arq could still be instantiated outside of the provider. However, that should still be okay since the the provider would ensure that the Arq its providing would be a singleton for every __call__() method call.