Closed kmike closed 1 year ago
I'm also slightly in favor to go with Approach 1 since the API is quite direct. The idea of simply specifying the item that you want rather than the PageObject needed to extract it is quite attractive. ✨
I think one of the common use cases would be extracting a subset of the fields, which is a problem as you've pointed out. I'm curious as to how we may be able to solve it. Adding fields is quite straightforward, since a new ItemType can be created containing the new fields.
There could be also a "reverse" registry, {ProductPage: Product}, which scrapy-poet can use to pick a right Page Object from the registry if ProductPage is requested, though this approach has some challenges.
You mentioned the following idea above has some challenges. Could you expand on the specific issues surrounding it? I was under the impression that this "reverse registry" works almost the same as the one with Approach 2 (but the pairs are reversed).
In any case, I think using typing.Annotated
should be a step in the right direction. It's only available starting in Python 3.9 but we could get it in https://github.com/python/typing_extensions for 3.7 and 3.8. Although, with some slight limitations: "get_origin
and get_args
lack support for Annotated
in Python 3.8" (not sure for now if we need these).
Take my feedback with a grain of salt. I feel quite unfamiliar with the use of overrides, I am not sure of what the best approaches should be for page object organization (when to subclass, when not to), and I have had to re-read this issue a few times and glance at the overrides documentation.
My thoughts:
+1 to approach 1.
I am a bit confused about the 2 code examples in the section about pros and cons.
Specifically, I am not sure if PowerfulProductPage
is meant to be a
subclass of ProductPage
or not.
Would it not be possible to allow the following syntax as long as
PowerfulProductPage
is a subclass of ProductPage
?
# your project:
@handle_urls("example.com")
class ExamplePageObject(PowerfulProductPage):
# ...
# generic spider:
def parse_product(self, response: DummyResponse, page: ProductPage):
# ...
If PowerfulProductPage
is not meant to be a ProductPage
subclass, why
would that be, specially if ProductPage
is a “marker” class?
Approach 2 looks more powerful, especially if web-poet fields are used heavily, and a spider actually needs to access only some of the fields, but not all of them.
I do not love the use of Annotated
instead of a page object for this
purpose.
I think a reverse registry to find related page objects based on what they return makes sense to enable getting the right overridden page class, even if as you mention it may be challenging.
But even without it, I would rather set overrides=ProductPage
and get a
page than use Annotated
to get a product with a subset of fields.
If I really wanted to get a product with a subset of fields, I would prefer
an approach that does not require me to list those fields in the callback
signature, e.g. maybe define a product subclass, or get the page object
and use something like page.to_items(fields=[…])
.
[Approach 1] doesn't support some features which Approach 2 gives.
Could you elaborate?
It is not clear to me that there is anything in approach 2 that cannot be
achieved with approach 1, assuming approach 1 includes using overrides=
where needed.
Unrelated to the goals here: I wonder if we should work towards making a
callback signature without response: DummyResponse
eventually possible in
Scrapy:
def parse(self, product: Product):
We could deprecate response
being a positional argument, in preparation
for a future where it will be an optional, keyword-only parameter called
response
.
We could even implement a boolean setting to allow enabling this new behavior before the deprecation period passes.
Once it is possible, we could interpret a missing response
parameter as
equivalent to response: DummyResponse
, and maybe even deprecate
DummyResponse
if it has no other use.
@BurnzZ
You mentioned the following idea above has some challenges. Could you expand on the specific issues surrounding it? I was under the impression that this "reverse registry" works almost the same as the one with Approach 2 (but the pairs are reversed).
I'm trying to recall what was that, but can't :) So, likely you're right, the same challenges as with Approach 2.
In any case, I think using typing.Annotated should be a step in the right direction. It's only available starting in Python 3.9 but we could get it in https://github.com/python/typing_extensions for 3.7 and 3.8. Although, with some slight limitations: "get_origin and get_args lack support for Annotated in Python 3.8" (not sure for now if we need these).
get_origin and get_args are not available in 3.7, so we'd need to have some backports anyways.
@Gallaecio
If PowerfulProductPage is not meant to be a ProductPage subclass, why would that be, specially if ProductPage is a “marker” class?
Most likely, it's a ProductPage subclass. So far, we've been using exact classes everywhere (scrapy-poet, andi), without allowing subclasses. So, while possible, I think there would be challenges and edge cases which we were avoiding so far. For example, a subclass may return a different item type, and this "implicit" binding to ProductPage could be wrong. But I'm not sure.
But even without it, I would rather set overrides=ProductPage and get a page than use Annotated to get a product with a subset of fields.
We might need something like Annotated anyways. If a Product instance is returned by HTTP API, and HTTP API allows to pick fields, how would you specify a Page Object dependency for that?
It is not clear to me that there is anything in approach 2 that cannot be achieved with approach 1, assuming approach 1 includes using overrides= where needed.
This "where needed" is an issue. With the common syntax, you need to use overrides=...
when you define a Page Object. But you only know if it's needed when a Page Object is used in some spider. So, if we don't know yet how a Page Object is going to be used, we'd need to always use overrides=
, which is status quo.
This is solvable though - if I'm not mistaken, it should be possible to use registry.handle_urls
as a function, not as a decorator. So, users can "register" page objects as supporting override of some base class - either one by one, or using some code to manupulate the registry. But I haven't checked that it all indeed works.
Unrelated to the goals here: I wonder if we should work towards making a callback signature without response: DummyResponse eventually possible in Scrapy
+1 .
We might need something like Annotated anyways. If a Product instance is returned by HTTP API, and HTTP API allows to pick fields, how would you specify a Page Object dependency for that?
I was thinking you would ask for the page object instead, and use page params. But I am not sure anymore.
My lack of love for Annotated
is based on not liking long typing data in signatures, which is not a strong argument by any means. And given you can define types separately, i.e. define a new, short, readable type (e.g. ProductSubset
) somewhere in your code with Annotated
and use that short type in your signature, the argument becomes void.
This "where needed" is an issue. With the common syntax, you need to use
overrides=...
it when you define a Page Object. But you only know if it's needed when a Page Object is used in some spider. So, if we don't know yet how a Page Object is going to be used, we'd need to always useoverrides=
, which is status quo.
Oh, right, the page object and the spider may be defined by separate people on separate projects, which is the whole point of web poet to begin with.
This is solvable though - if I'm not mistaken, it should be possible to use
registry.handle_urls
as a function, not as a decorator. So, users can "register" page objects as supporting override of some base class - either one by one, or using some code to manipulate the registry. But I haven't checked that it all indeed works.
I see how this kind of defeats the benefits of approach 1.
After re-reading it:
So, now it's +0.75 from me for Approach 1, not +0.25 :)
We have it released.
hey! I’m exploring how to make the following work:
In other words, how to make
@handle_urls("example.com")
work for anItemPage[Product]
subclass without a need to useinstead_of
in handle_urls, and without a need to use a base page object forinstead_of
.I can see 2 main approaches here.
Approach 1: support is directly
In the example, handle_urls doesn't really define any override rule. Instead, we have a declaration that ExamplePage can return Product item for example.com page. This information should be enough to allow creation of a
scrapy-poet
provider for items:We know the website, we know which item is needed, and can use Page Object registry to find a right page object, according to domain and priority.
To implement it, web-poet needs to be slightly refactored:
ApplyRule(for_patterns=..., use=ExampleProductPage, instead_of=GenericProductPage)
, it’d be possible to specifyApplyRule(for_patterns=..., use=ExampleProductPage, to_return=Product)
handle_urls
decorator would pick upto_return=Product
from theItemPage[Product]
automatically (probably unless instead_of argument is defined? but that's not clear, there are use cases for supporting both at the same time).When implementing it, we should make sure that priorities work as intended. For example, it should be possible to define and configure a Page Object which provides Product instances using AutoExtract API, and have it behaving in a sensible way (enabled for all domains, custom Page Objects should take priority over this default one).
Approach 2: define standard generic Page Object classes
In this case, "override rules" stay override rules. There is a registry of
{item_type: generic_page_object_class}
defined somewhere, e.g.{Product: ProductPage}
. Or, maybe,{Product: [ProductPage, AnotherGenericProductPage]}
. There should be an API to extend and modify this registry. handle_urls looks up this registry, and picksinstead_of
argument based on it.Pros/cons
Overall, I like an approach with items more, because it seems to provide a nicer API:
A risk here is that we may decide that standardized page objects are actually very important. For example, unlike items, they allow to extract only some of the fields. They may also provide some methods other than to_item which can be used in overrides.
A challenge with standard base classes is how much logic you put there. On one hand, there is a lot of useful logic which can be put in a base class, and which can save developer time. For example, default implementation for some of the attributes, or helper methods. But the more powerful your base class is, the more you need to assume about the implementation. So, it might be wise to have these "standard" page objects mostly as "markers", to ensure that a wide range of page objects is compatible with each other. But, if we do so, we still need to put the "powerful" logic somewhere.
If you have separate "marker" base class used for overrides, and feature-full base class used in a project, usage becomes less straightforward - you'd probably need to do something like this - the original challenge is not solved:
Alternative is
But it defeats the purpose of defining a standard ProductPage which is not tied to a project-specific spider - generic spider above doesn't support pages which are just ProductPage subclasses, it needs its own, PowerfulProductPage subclasses. It also requires solving a case when a page objects which uses handle_urls is not a strict subclass of PowerfulProductPage.
With "items" approach it's not an issue - as soon as page objects return the same item class, they're considered compatible, there is no need for them to share exactly same base class, besides having
ItemPage[ItemClass]
orReturns[ItemClass]
somewhere in hierarchy.Not extracting all fields with Approach 1
The main cons I see with the Approach 1 is a case where not all item fields are required in a generic spider.
There is no ProductPage anymore, and if we define it, it's not stated that ExamplePage can be used instead of ProductPage. There are some possible solutions to it, e.g. using typing.Annotated:
There could be also a "reverse" registry,
{ProductPage: Product}
, which scrapy-poet can use to pick a right Page Object from the registry if ProductPage is requested, though this approach has some challenges.Using custom methods
Users might want to have a Page Object with some custom methods, and use it in a crawler:
This is compatible with both Approach 1 and Approach 2, if ProductListPage uses ProductList as a dependency:
In this case, one can define POs for ProductList using Approach 1, and they will be used automatically to create
item
dependency. It's also possible to override ProductListPage itself, if it's desired to have a customlinks
implementation. So, both pages below would be applied properly:Conclusion
Approach 1 is tempting because it seems to provide a better API - one can receive items in callbacks, and just use them; there is no need e.g. to use
item = await ensure_deferred(page.to_item())
in every callback, there is no need to useasync def parse
. It also gives fully typing support, which can be hit or miss with page objects.For me it's also much easier to understand - there are no "overrides" anymore, no semi-magical replacement of one class with another. We're just telling "this Page Object returns Product item on example.com website", and then, when Product item is needed on example.com website, we know how to get it.
Approach 2 looks more powerful, especially if web-poet fields are used heavily, and a spider actually needs to access only some of the fields, but not all of them.
Currently I'm +0.25 to implement, follow and recommend Approach 1, mostly because of it's simplicity, even if it doesn't support some features which Approach 2 gives. But I do see advantages of Approach 2 as well, so it's not a strong opinon.
Thoughts?