realpython / book2-exercises

Book 2 -- Exercises for the book
168 stars 203 forks source link

Socrata scraping and crawling examples need to be updated (p. 482 &seq.) #80

Closed stonemirror closed 7 years ago

stonemirror commented 8 years ago

The structure of the pages returned at https://opendata.socrata.com/ no longer appear to be table-based, but are now a piles of nested divs. The code as it stands does nothing, and I'm messing around with xpaths to try to come up with the right expressions to pull the values I want to get here...

stonemirror commented 8 years ago

Okay, here's a scraper for p. 483 that works, as a replacement for the current opendata.py. May be sub-optimal, feedback's appreciated. Everything's coming back from extract() as unicode in a list, the views value is in the middle of a bunch of white space, etc. Let me know if I'm doing anything bone-headed here, but my socrata.json file now contains the results I'd have expected.

from scrapy import Spider
from scrapy.selector import Selector

from socrata.items import SocrataItem

class OpendataSpider(Spider):
    name = "opendata"
    allowed_domains = ["opendata.socrata.com"]
    start_urls = (
        'https://opendata.socrata.com/',
    )

    def parse(self, response):
        titles = Selector(response).xpath('//div[@itemscope="itemscope"]')
        for title in titles:
            item = SocrataItem()
            item["text"] = title.xpath('div/div[1]/div/div/h2/a/text()').extract()[0].encode('ascii', 'ignore').strip()
            item["url"] = title.xpath('div/div[1]/div/div/h2/a/@href').extract()[0].encode('ascii', 'ignore').strip()
            item["views"] = title.xpath('div/div[4]/div[2]/div[2]/text()').extract()[0].encode('ascii', 'ignore').strip()
            yield item
stonemirror commented 8 years ago

Okay, it's still running (and I guess it will be for a while, as you say), but here's the code I arrived at to update the crawler for Socrata. Note that there seems to be an error in the original code, in that the parse() method is being overwritten for the CrawlSpider class (which the documentation says is a bad idea) but the parse_item() method is being specified as the callback in the rule.

Until I figured out this discrepancy, the spider only scraped a single page and then stopped. Now, it seems to be chugging along, so I'll just let it run and add a note here when/if it finishes correctly...

This runs at the command line with scrapy crawl opendatacrawl

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

from socrata.items import SocrataItem

class OpendataSpider(CrawlSpider):
    name = "opendatacrawl"
    allowed_domains = ["opendata.socrata.com"]
    start_urls = (
        'https://opendata.socrata.com/',
    )
    rules = [
        Rule(LinkExtractor(allow='browse\?utf8=%E2%9C%93&page=\d*'), callback='parse_item', follow=True)
    ]

    def parse_item(self, response):
        titles = response.xpath('//div[@itemscope="itemscope"]')
        for title in titles:
            item = SocrataItem()
            item["text"] = title.xpath('div/div[1]/div/div/h2/a/text()').extract()[0].encode('ascii', 'ignore').strip()
            item["url"] = title.xpath('div/div[1]/div/div/h2/a/@href').extract()[0].encode('ascii', 'ignore').strip()
            item["views"] = title.xpath('div/div[4]/div[2]/div[2]/text()').extract()[0].encode('ascii', 'ignore').strip()
            yield item
stonemirror commented 8 years ago

Yeah, that works. Took on the order of three hours to run, went through 826 pages, scraped 8246 entries into my database...

mjhea0 commented 8 years ago

@stonemirror Just getting up to speed. Did you have to update any of the code?

stonemirror commented 8 years ago

Yes, @mjhea0, my changes to opendata.py and opendata_crawl.py are in the following comments. The xpaths are more than likely suboptimal or over-specified, but they work. I'm not sure what your specific intent with the original was, but this all produces sensible results.

mjhea0 commented 8 years ago

Thanks, @stonemirror. The course is being updated now. Should get to that chapter in the next few days. Thanks again!!

mjhea0 commented 7 years ago

Updated in v 2.0 of the courses. Coming week of 12/19/2016. https://github.com/realpython/about/blob/master/changelog.csv