scrapinghub / web-poet

Web scraping Page Objects core library
https://web-poet.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
93 stars 15 forks source link

Proposal: Utility functions that interacts with the rules #40

Open BurnzZ opened 2 years ago

BurnzZ commented 2 years ago

Background

Following the acceptance of https://github.com/scrapinghub/web-poet/pull/27, developers could now use URL patterns to declare which Page Objects would work on specific URL patterns (reference code).

Problem

For large code bases, there might be hundreds of Page Objects which in turn could also result in hundreds of OverrideRule created using the @handle_urls annotation.

This could be unwieldy especially when they're spread out across multiple different subpackages and submodules within a Page Object Project. A project could utilize other Page Objects from other external packages, leading to a deeper roots.

Moreover, overlapping rules (e.g. POs improving on older POs) could add another layer of complexity. It should be immediately clear which PO would be executed according to URL pattern and priority.

Idea

There should be some sort of collection of utility functions that could interact with the List[OverrideRule] from the registry. Suppose that we have:

from web_poet import default_registry, consume_modules

consume_modules(my_page_objects, some_other_project, another_project)
rules = default_registry.get_overrides()

We could then have something like:

from web_poet import rule_match

# Explore which OverrideRules are matches a given URL.
rule_match.find(rules, url="https://example.com/product/electronics?id=123")
# Returns: [OverrideRule_1, OverrideRule_2, OverrideRule_3, OverrideRule_4]

# It could also narrow down the search
rule_match.find(rules, url="https://example.com/product/electronics?id=123", overridden=ProductPage)
# Returns: [OverrideRule_2, OverrideRule_4]

# Finding the rules for a given set of criteria could result in multiple OverrideRules.
# This could be POs improving on older POs which could also improve on other POs.

# However, what we would ultimately want is the Final rule that has the highest priority
rule_match.final(rules, url="https://example.com/product/electronics?id=123", overridden=ProductPage)
# Returns: OverrideRule_2

This could help lead in creating test suites in projects that utilize other Page Object projects:

assert ImprovedProductPage == rule_match.final(
    rules, "https://example.com/product/electronics?id=123", overridden=ProductPage
).use

Other Notes:

kmike commented 2 years ago

I think that's a good idea, but probably it would make sense to wait a bit, when a real-world use case would pop up. Then we can think about how to help solving it.

Gallaecio commented 7 months ago

For the stated issue, I wonder if an opt-in setting in scrapy-poet that enables logging a debug message indicating which page object is used for any given URL and requested output, and why, could do the trick.