Open josephpd3 opened 7 years ago
Good observation. Generally I think it's easier to start with the most basic scrapy.Spider
and as you get more familiar with the page and rules you are hand coding, look to other spider types to see if you can leverage extended functionality. I have not spent a whole lot of time on the site (thank god) but from the looks of it the way you have it setup makes sense. We can just start at /discover
and manually identify links to individual bounties via selectors incollect_bounties
& then feed those URLs to parse_bounty
Great point, @bstarling! It could definitely be better to learn with the default spider and work towards using the extended ones if this is someone's first time with Scrapy.
For this ticket, we should be able to run the spider via scrapy <spider name> -o <file>.json
so that we use Scrapy's native pipeline and just export a file, as Ben mentioned on the pipeline ticket.
I also like breaking this into two tasks, as those who want to work with Scrapy and learn can get a shot at it:
collect_bounties
on the /discover
page and churning those out into a JSON file of bounty linksparse_bounty
, which will actually use the defined Item
to collect data from the resolved links.Note: You won't have to implement an Item
in Scrapy just for the links. See here to get an idea how to go about doing that. You essentially just return dictionaries.
Also: If anyone didn't see in the README, Scrapy has a really great shell for figuring out extraction functionality.
For this, you'll want to read up on the CrawlSpider. This allows you to define rules for extracting links from pages to be resolved and parsed.
My initial commit uses the base
scrapy.Spider
as the basis for the WeSearchr spider, but theCrawlSpider
would allow us to have a more flexible solution under any changes to site page structure.The Dublin Council Crawler is a good D4D example using the
CrawlSpider
class.