josephpd3 / wesearchr-spider

Scrapy project for crawling the alt-right bounty corwd-sourcing site, WeSearchr
0 stars 0 forks source link

Finish Spider to Crawl Bounties #1

Open josephpd3 opened 7 years ago

josephpd3 commented 7 years ago

For this, you'll want to read up on the CrawlSpider. This allows you to define rules for extracting links from pages to be resolved and parsed.

My initial commit uses the base scrapy.Spider as the basis for the WeSearchr spider, but the CrawlSpider would allow us to have a more flexible solution under any changes to site page structure.

The Dublin Council Crawler is a good D4D example using the CrawlSpider class.

bstarling commented 7 years ago

Good observation. Generally I think it's easier to start with the most basic scrapy.Spider and as you get more familiar with the page and rules you are hand coding, look to other spider types to see if you can leverage extended functionality. I have not spent a whole lot of time on the site (thank god) but from the looks of it the way you have it setup makes sense. We can just start at /discover and manually identify links to individual bounties via selectors incollect_bounties & then feed those URLs to parse_bounty

josephpd3 commented 7 years ago

Great point, @bstarling! It could definitely be better to learn with the default spider and work towards using the extended ones if this is someone's first time with Scrapy.

For this ticket, we should be able to run the spider via scrapy <spider name> -o <file>.json so that we use Scrapy's native pipeline and just export a file, as Ben mentioned on the pipeline ticket.

I also like breaking this into two tasks, as those who want to work with Scrapy and learn can get a shot at it:

Note: You won't have to implement an Item in Scrapy just for the links. See here to get an idea how to go about doing that. You essentially just return dictionaries.

Also: If anyone didn't see in the README, Scrapy has a really great shell for figuring out extraction functionality.