scrapinghub / scrapy-poet

Page Object pattern for Scrapy
BSD 3-Clause "New" or "Revised" License
119 stars 28 forks source link

Discussion for subdomain overrides #52

Closed BurnzZ closed 2 years ago

BurnzZ commented 3 years ago

Background

Currently, the @override_for decorator only works for top-level domains. However, there are some cases wherein a site has multiple subdomains, usually per country (where each has a slightly different layout). This is due to the current setup of get_domain() (ref).

The problem lies in the following workaround due to limitations of the current @override_for decorator interface:

@override_for("example.com", AutoExtractArticleListData)
@attr.s(auto_attribs=True)
class ExampleArticleListPage(AutoExtractItemWebPage):
    article_list_data: AutoExtractArticleListData

    def to_item(self) -> ArticleList:
        article_list = self.article_list_data.to_item()

        if "uk.example.com" in self.url:
            return self._handle_uk(article_list)

        elif "au.example.com" in self.url:
            return self._handle_uk(article_list)

        elif "fr.example.com" in self.url:
            return self._handle_fr(article_list)

        return article_list

Objectives

I'm wondering if it's worth it to update the current interface with something like this?

@override_for("example.com", AutoExtractArticleListData, subdomain="uk")
@override_for("example.com", AutoExtractArticleListData, subdomain="au")
@override_for("example.com", AutoExtractArticleListData, subdomain="fr")

Depending on the situation, this provides another option on how to structure the collection of PageObjects in the project.

ivanprado commented 3 years ago

I've proposed a change to default overrides configuration that allows configuring overrides for subdomains among other things. https://github.com/scrapinghub/scrapy-poet/pull/53 .

Once merged, it will allow:

@override_for("uk.example.com", AutoExtractArticleListData)
@override_for("au.example.com", AutoExtractArticleListData)
@override_for("fr.example.com", AutoExtractArticleListData)

It will allow other interesting combinations like:

@override_for("example.com", AutoExtractArticleListData)
@override_for("example.com/deals", AutoExtractArticleListData)

or

@override_for("example.com", AutoExtractArticleListData)
@override_for("ie.example.com", AutoExtractArticleListData)

Feel free to review the PR if you want and have the time: comments are always welcome :-)

BurnzZ commented 2 years ago

Closing this issue as obsolete since the new PR in https://github.com/scrapinghub/scrapy-poet/pull/56 will be able to handle subdomain overrides as it leverages the functionalities of https://github.com/zytedata/url-matcher.