scrapinghub / web-poet

Web scraping Page Objects core library
https://web-poet.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
95 stars 15 forks source link

"Apply rules" instead of overrides #77

Closed kmike closed 1 year ago

kmike commented 2 years ago

hey! I’m exploring how to make the following work:

from zyte_common_items import Product
from web_poet import ItemPage, handle_urls

@handle_urls("example.com")
class ExamplePage(ItemPage[Product]):
    # ...

In other words, how to make @handle_urls("example.com") work for an ItemPage[Product] subclass without a need to use instead_of in handle_urls, and without a need to use a base page object for instead_of.

I can see 2 main approaches here.

Approach 1: support is directly

In the example, handle_urls doesn't really define any override rule. Instead, we have a declaration that ExamplePage can return Product item for example.com page. This information should be enough to allow creation of a scrapy-poet provider for items:

def parse(self, response: DummyResponse, product: Product):
    # ...

We know the website, we know which item is needed, and can use Page Object registry to find a right page object, according to domain and priority.

To implement it, web-poet needs to be slightly refactored:

  1. We should rename “overrides”, “override rule” to something like “apply rules”. This includes changes to class names, module names, method names, and changes to documentation as well. In addition to the old ApplyRule(for_patterns=..., use=ExampleProductPage, instead_of=GenericProductPage), it’d be possible to specify ApplyRule(for_patterns=..., use=ExampleProductPage, to_return=Product)
  2. handle_urls decorator would pick up to_return=Product from the ItemPage[Product] automatically (probably unless instead_of argument is defined? but that's not clear, there are use cases for supporting both at the same time).

When implementing it, we should make sure that priorities work as intended. For example, it should be possible to define and configure a Page Object which provides Product instances using AutoExtract API, and have it behaving in a sensible way (enabled for all domains, custom Page Objects should take priority over this default one).

Approach 2: define standard generic Page Object classes

In this case, "override rules" stay override rules. There is a registry of {item_type: generic_page_object_class} defined somewhere, e.g. {Product: ProductPage}. Or, maybe, {Product: [ProductPage, AnotherGenericProductPage]}. There should be an API to extend and modify this registry. handle_urls looks up this registry, and picks instead_of argument based on it.

Pros/cons

Overall, I like an approach with items more, because it seems to provide a nicer API:

A risk here is that we may decide that standardized page objects are actually very important. For example, unlike items, they allow to extract only some of the fields. They may also provide some methods other than to_item which can be used in overrides.

A challenge with standard base classes is how much logic you put there. On one hand, there is a lot of useful logic which can be put in a base class, and which can save developer time. For example, default implementation for some of the attributes, or helper methods. But the more powerful your base class is, the more you need to assume about the implementation. So, it might be wise to have these "standard" page objects mostly as "markers", to ensure that a wide range of page objects is compatible with each other. But, if we do so, we still need to put the "powerful" logic somewhere.

If you have separate "marker" base class used for overrides, and feature-full base class used in a project, usage becomes less straightforward - you'd probably need to do something like this - the original challenge is not solved:

# your project:
@handle_urls("example.com", instead_of=ProductPage)
class ExamplePageObject(PowerfulProductPage):
    # ...

# generic spider:
def parse_product(self, response: DummyResponse, page: ProductPage):
    # ...

Alternative is

# your project:
@handle_urls("example.com")
class ExamplePageObject(PowerfulProductPage):
    # ...

# generic spider:
def parse_product(self, response: DummyResponse, page: PowerfulProductPage):
    # ...

But it defeats the purpose of defining a standard ProductPage which is not tied to a project-specific spider - generic spider above doesn't support pages which are just ProductPage subclasses, it needs its own, PowerfulProductPage subclasses. It also requires solving a case when a page objects which uses handle_urls is not a strict subclass of PowerfulProductPage.

With "items" approach it's not an issue - as soon as page objects return the same item class, they're considered compatible, there is no need for them to share exactly same base class, besides having ItemPage[ItemClass] or Returns[ItemClass] somewhere in hierarchy.

Not extracting all fields with Approach 1

The main cons I see with the Approach 1 is a case where not all item fields are required in a generic spider.

@handle_urls("example.com")
class ExamplePage(ItemPage[Product]):
    # ...

def parse(self, response: DummyResponse, product: Product):
    # only some product attributes are used, not all of them

There is no ProductPage anymore, and if we define it, it's not stated that ExamplePage can be used instead of ProductPage. There are some possible solutions to it, e.g. using typing.Annotated:

def parse(self, response: DummyResponse, product: Annotated[Product, pick_fields(["price", "name"])):
    # ...

There could be also a "reverse" registry, {ProductPage: Product}, which scrapy-poet can use to pick a right Page Object from the registry if ProductPage is requested, though this approach has some challenges.

Using custom methods

Users might want to have a Page Object with some custom methods, and use it in a crawler:


class ProductListPage(ItemPage[ProductList]):
    def links(self):
        # example utility method
        return [self.paginationNext.url] + [product.url for product in self.products]

def parse(self, response: DummyResponse, page: ProductListPage):
    yield from scrapy.follow_all(page.links())
    # ...

This is compatible with both Approach 1 and Approach 2, if ProductListPage uses ProductList as a dependency:

@attrs.define
class ProductListPage(ItemPage[ProductList]):
    item: ProductList
    def links(self):
        # example utility method
        return [self.item.paginationNext.url] + [product.url for product in self.item.products]

    def to_item(self):
        return self.item

In this case, one can define POs for ProductList using Approach 1, and they will be used automatically to create item dependency. It's also possible to override ProductListPage itself, if it's desired to have a custom links implementation. So, both pages below would be applied properly:


# for foo.com we state that "FooProductListPage returns productList" 
@handle_urls("foo.com")
class FooProductListPage(ItemPage[ProductList]):
    ...

# for bar.com we override ProductListPage itself
@handle_urls("bar.com", overrides=ProductListPage)
class FooProductListPage(ProductListPage):
    def links(self):
        # ...

Conclusion

Approach 1 is tempting because it seems to provide a better API - one can receive items in callbacks, and just use them; there is no need e.g. to use item = await ensure_deferred(page.to_item()) in every callback, there is no need to use async def parse. It also gives fully typing support, which can be hit or miss with page objects.

For me it's also much easier to understand - there are no "overrides" anymore, no semi-magical replacement of one class with another. We're just telling "this Page Object returns Product item on example.com website", and then, when Product item is needed on example.com website, we know how to get it.

Approach 2 looks more powerful, especially if web-poet fields are used heavily, and a spider actually needs to access only some of the fields, but not all of them.

Currently I'm +0.25 to implement, follow and recommend Approach 1, mostly because of it's simplicity, even if it doesn't support some features which Approach 2 gives. But I do see advantages of Approach 2 as well, so it's not a strong opinon.

Thoughts?

BurnzZ commented 2 years ago

I'm also slightly in favor to go with Approach 1 since the API is quite direct. The idea of simply specifying the item that you want rather than the PageObject needed to extract it is quite attractive. ✨

I think one of the common use cases would be extracting a subset of the fields, which is a problem as you've pointed out. I'm curious as to how we may be able to solve it. Adding fields is quite straightforward, since a new ItemType can be created containing the new fields.

There could be also a "reverse" registry, {ProductPage: Product}, which scrapy-poet can use to pick a right Page Object from the registry if ProductPage is requested, though this approach has some challenges.

You mentioned the following idea above has some challenges. Could you expand on the specific issues surrounding it? I was under the impression that this "reverse registry" works almost the same as the one with Approach 2 (but the pairs are reversed).

In any case, I think using typing.Annotated should be a step in the right direction. It's only available starting in Python 3.9 but we could get it in https://github.com/python/typing_extensions for 3.7 and 3.8. Although, with some slight limitations: "get_origin and get_args lack support for Annotated in Python 3.8" (not sure for now if we need these).

Gallaecio commented 2 years ago

Take my feedback with a grain of salt. I feel quite unfamiliar with the use of overrides, I am not sure of what the best approaches should be for page object organization (when to subclass, when not to), and I have had to re-read this issue a few times and glance at the overrides documentation.

My thoughts:

kmike commented 2 years ago

@BurnzZ

You mentioned the following idea above has some challenges. Could you expand on the specific issues surrounding it? I was under the impression that this "reverse registry" works almost the same as the one with Approach 2 (but the pairs are reversed).

I'm trying to recall what was that, but can't :) So, likely you're right, the same challenges as with Approach 2.

In any case, I think using typing.Annotated should be a step in the right direction. It's only available starting in Python 3.9 but we could get it in https://github.com/python/typing_extensions for 3.7 and 3.8. Although, with some slight limitations: "get_origin and get_args lack support for Annotated in Python 3.8" (not sure for now if we need these).

get_origin and get_args are not available in 3.7, so we'd need to have some backports anyways.

kmike commented 2 years ago

@Gallaecio

If PowerfulProductPage is not meant to be a ProductPage subclass, why would that be, specially if ProductPage is a “marker” class?

Most likely, it's a ProductPage subclass. So far, we've been using exact classes everywhere (scrapy-poet, andi), without allowing subclasses. So, while possible, I think there would be challenges and edge cases which we were avoiding so far. For example, a subclass may return a different item type, and this "implicit" binding to ProductPage could be wrong. But I'm not sure.

But even without it, I would rather set overrides=ProductPage and get a page than use Annotated to get a product with a subset of fields.

We might need something like Annotated anyways. If a Product instance is returned by HTTP API, and HTTP API allows to pick fields, how would you specify a Page Object dependency for that?

It is not clear to me that there is anything in approach 2 that cannot be achieved with approach 1, assuming approach 1 includes using overrides= where needed.

This "where needed" is an issue. With the common syntax, you need to use overrides=... when you define a Page Object. But you only know if it's needed when a Page Object is used in some spider. So, if we don't know yet how a Page Object is going to be used, we'd need to always use overrides=, which is status quo.

This is solvable though - if I'm not mistaken, it should be possible to use registry.handle_urls as a function, not as a decorator. So, users can "register" page objects as supporting override of some base class - either one by one, or using some code to manupulate the registry. But I haven't checked that it all indeed works.

Unrelated to the goals here: I wonder if we should work towards making a callback signature without response: DummyResponse eventually possible in Scrapy

+1 .

Gallaecio commented 2 years ago

We might need something like Annotated anyways. If a Product instance is returned by HTTP API, and HTTP API allows to pick fields, how would you specify a Page Object dependency for that?

I was thinking you would ask for the page object instead, and use page params. But I am not sure anymore.

My lack of love for Annotated is based on not liking long typing data in signatures, which is not a strong argument by any means. And given you can define types separately, i.e. define a new, short, readable type (e.g. ProductSubset) somewhere in your code with Annotated and use that short type in your signature, the argument becomes void.

This "where needed" is an issue. With the common syntax, you need to use overrides=... it when you define a Page Object. But you only know if it's needed when a Page Object is used in some spider. So, if we don't know yet how a Page Object is going to be used, we'd need to always use overrides=, which is status quo.

Oh, right, the page object and the spider may be defined by separate people on separate projects, which is the whole point of web poet to begin with.

This is solvable though - if I'm not mistaken, it should be possible to use registry.handle_urls as a function, not as a decorator. So, users can "register" page objects as supporting override of some base class - either one by one, or using some code to manipulate the registry. But I haven't checked that it all indeed works.

I see how this kind of defeats the benefits of approach 1.

kmike commented 2 years ago

After re-reading it:

  1. I think we should implement item-based API, regardless of a decision about handle_urls. I.e. allow to depend on items directly, without page object boilerplate, e.g. as proposed in https://github.com/scrapinghub/scrapy-autoextract/issues/23. It's such a cool API, which is easy to understand and easy to use; I can't think of a simpler API. Our ecosystem of libraries would be worse if we don't do it.
  2. It seems that problems with Approach 1 are solvable, as explained in the pros/cons section. Solutions don't look too bad. They may be not easiest to implement, but e.g. using Annotated for picking only some fields would be required anyways.

So, now it's +0.75 from me for Approach 1, not +0.25 :)

kmike commented 1 year ago

We have it released.