scrapinghub / scrapy-poet

Page Object pattern for Scrapy
BSD 3-Clause "New" or "Revised" License
119 stars 28 forks source link

hundred of websites ? #162

Closed lironesamoun closed 8 months ago

lironesamoun commented 1 year ago

I've gone through all the scrapy-poet documentation and it's really interesting. I really understand the separation of the crawling part from the extraction part.

When there are a few sites to scrape, I understand that you have to define the sites as input.

In my case, I want to create a spider that scrap certain type of information from a website. Each of these sites may have a different structure, but that's okay, because I've tried to do something that gives me the information I want.

However, how do I deal with URL inputs if I have a hundred websites to crawl and scrap ?

This is part of the doc:

class BooksSpider(scrapy.Spider):
    name = "books_04_overrides_02"
    # Configuring different page objects pages for different domains
    custom_settings = {
        "SCRAPY_POET_RULES": [
            ApplyRule("toscrape.com", use=BTSBookListPage, instead_of=BookListPage),
            ApplyRule("toscrape.com", use=BTSBookPage, instead_of=BookPage),
            ApplyRule("bookpage.com", use=BPBookListPage, instead_of=BookListPage),
            ApplyRule("bookpage.com", use=BPBookPage, instead_of=BookPage),
        ]
    }

    def start_requests(self):
        for url in ["http://books.toscrape.com/", "https://bookpage.com/reviews"]:
            yield scrapy.Request(url, callback=self.parse)

    def parse(self, response, page: BookListPage):
        yield from response.follow_all(page.book_urls(), callback_for(BookPage))

I don't see myself putting my hundred of websites there. Ideally, I would like to configure my spyder to take the url from my DB.

I took as well an eye on the registry, but it's the same, I need to define the decorator with the url to scrap that I don't know in advance.

Is there anything I've missed? Do you have any insights for me ?

Gallaecio commented 1 year ago

You can use an empty string (see URL patterns in the web-poet docs) to define a catch-all page object rule. See also the rule precedence docs (your scenario seems in line with the example mentioned in the 2nd paragraph).

If this catch-all approach is not ideal, e.g. maybe you have 2 page objects each targeting a large number of websites, you can register the page object URLs through the add_rule method of the global rule registry instead of relying on the @handle_urls decorator.

I am not sure what the best location would be to place the code that reads from the database into the registry. Maybe the __init__ method of a custom extension?

lironesamoun commented 1 year ago

Thnak for your answer ! So according to you, I could instantiate on the fly rules for each item from my database ? That sounds interesting. Need to explore.