Closed lironesamoun closed 8 months ago
You can use an empty string (see URL patterns in the web-poet docs) to define a catch-all page object rule. See also the rule precedence docs (your scenario seems in line with the example mentioned in the 2nd paragraph).
If this catch-all approach is not ideal, e.g. maybe you have 2 page objects each targeting a large number of websites, you can register the page object URLs through the add_rule
method of the global rule registry instead of relying on the @handle_urls
decorator.
I am not sure what the best location would be to place the code that reads from the database into the registry. Maybe the __init__
method of a custom extension?
Thnak for your answer ! So according to you, I could instantiate on the fly rules for each item from my database ? That sounds interesting. Need to explore.
I've gone through all the scrapy-poet documentation and it's really interesting. I really understand the separation of the crawling part from the extraction part.
When there are a few sites to scrape, I understand that you have to define the sites as input.
In my case, I want to create a spider that scrap certain type of information from a website. Each of these sites may have a different structure, but that's okay, because I've tried to do something that gives me the information I want.
However, how do I deal with URL inputs if I have a hundred websites to crawl and scrap ?
This is part of the doc:
I don't see myself putting my hundred of websites there. Ideally, I would like to configure my spyder to take the url from my DB.
I took as well an eye on the registry, but it's the same, I need to define the decorator with the url to scrap that I don't know in advance.
Is there anything I've missed? Do you have any insights for me ?