Closed BurnzZ closed 1 year ago
I'm not sure how it could work. The referenced code in scrapy-poety is scrapy-specific, that's why it's in scrapy-poet package, not in web-poet.
Also, page objects may define all kinds of dependencies, which are not realted to Scrapy's response; from_response constructor looks quite limited. There could be page objects which don't need Scrapy response, and page objects which need something which doesn't come from a Scrapy response (i.e. response is not enough). There could be lots of different ways to get these dependencies: extract them from some already present data (like from_response), do some async requests (using twisted deferreds? using asyncio? etc). It could also be the case that dependencies are inspected on import time, and they're actually provided on a different machine.
So, -1 to add from_response to Page Objects, because it ties web-poet to Scrapy, and also it's not generic enough. It's an anti-pattern to use something like from_response, because it'd mean that only a certain kind of Page Objects is supported. That's not an issue if a framework like scrapy-poet is used, which relies on andi.
Overall, if we can extract some code to create nested dependencies, it'd be good, but I'm not sure how to do it, beyond what andi is providing.
Probably what's missing is something simple, probably non-asynchronous framework, to run page objects, which can be used for quick tests, in IPython notebooks, etc. But that's a separate discussion.
Background
Given the following PO structure below:
The following would not work since
HTMLWebPage
is now a subclass ofWebPage
and it effectively requires bothresponse: ResponseData
andhtml: HTMLFromResponse
when using its constructor:We'll need to provide both of the required constructor arguments:
This is a bit tedious since underneath the code, the actual core dependency in the tree would only be
ResponseData
. If the PO we're instantiating has a deeply nested depedency structure, it would be hard to keep track of all the necessary constructor arguments.However, when POs are used in a Scrapy Project which uses the InjectionMiddleware provided by https://github.com/scrapinghub/scrapy-poet, this doesn't become a problem since it takes care of handling all necessary dependencies for the PO (since it uses https://github.com/scrapinghub/andi underneath):
Problem
@gatufo raise a good point about using POs outside the context of a Scrapy project, but ultimately withholds access to the dependency resolution conveniently provided by https://github.com/scrapinghub/scrapy-poet.
Nonetheless, this also expands the use cases supported by POs beyond the spider, like using it in a script, deploying it behind an API, etc.
Proposal
This issue aims to discuss and explore the possibilities of moving the necesary injection logic already implemented in scrapy-poet (reference module) into web-poet itself.
The said migrated logic could then be accessed via the alternative constructor named
from_response()
(see example below).from_response()
could be renamed to something else, but this closely follows Scrapy's conventions in its alternative constructors likefrom_crawler()
,from_settings()
, etc.