Closed stonemirror closed 7 years ago
Okay, here's a scraper for p. 483 that works, as a replacement for the current opendata.py
. May be sub-optimal, feedback's appreciated. Everything's coming back from extract() as unicode in a list, the views value is in the middle of a bunch of white space, etc. Let me know if I'm doing anything bone-headed here, but my socrata.json file now contains the results I'd have expected.
from scrapy import Spider
from scrapy.selector import Selector
from socrata.items import SocrataItem
class OpendataSpider(Spider):
name = "opendata"
allowed_domains = ["opendata.socrata.com"]
start_urls = (
'https://opendata.socrata.com/',
)
def parse(self, response):
titles = Selector(response).xpath('//div[@itemscope="itemscope"]')
for title in titles:
item = SocrataItem()
item["text"] = title.xpath('div/div[1]/div/div/h2/a/text()').extract()[0].encode('ascii', 'ignore').strip()
item["url"] = title.xpath('div/div[1]/div/div/h2/a/@href').extract()[0].encode('ascii', 'ignore').strip()
item["views"] = title.xpath('div/div[4]/div[2]/div[2]/text()').extract()[0].encode('ascii', 'ignore').strip()
yield item
Okay, it's still running (and I guess it will be for a while, as you say), but here's the code I arrived at to update the crawler for Socrata. Note that there seems to be an error in the original code, in that the parse()
method is being overwritten for the CrawlSpider class (which the documentation says is a bad idea) but the parse_item()
method is being specified as the callback in the rule.
Until I figured out this discrepancy, the spider only scraped a single page and then stopped. Now, it seems to be chugging along, so I'll just let it run and add a note here when/if it finishes correctly...
This runs at the command line with scrapy crawl opendatacrawl
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from socrata.items import SocrataItem
class OpendataSpider(CrawlSpider):
name = "opendatacrawl"
allowed_domains = ["opendata.socrata.com"]
start_urls = (
'https://opendata.socrata.com/',
)
rules = [
Rule(LinkExtractor(allow='browse\?utf8=%E2%9C%93&page=\d*'), callback='parse_item', follow=True)
]
def parse_item(self, response):
titles = response.xpath('//div[@itemscope="itemscope"]')
for title in titles:
item = SocrataItem()
item["text"] = title.xpath('div/div[1]/div/div/h2/a/text()').extract()[0].encode('ascii', 'ignore').strip()
item["url"] = title.xpath('div/div[1]/div/div/h2/a/@href').extract()[0].encode('ascii', 'ignore').strip()
item["views"] = title.xpath('div/div[4]/div[2]/div[2]/text()').extract()[0].encode('ascii', 'ignore').strip()
yield item
Yeah, that works. Took on the order of three hours to run, went through 826 pages, scraped 8246 entries into my database...
@stonemirror Just getting up to speed. Did you have to update any of the code?
Yes, @mjhea0, my changes to opendata.py and opendata_crawl.py are in the following comments. The xpaths are more than likely suboptimal or over-specified, but they work. I'm not sure what your specific intent with the original was, but this all produces sensible results.
Thanks, @stonemirror. The course is being updated now. Should get to that chapter in the next few days. Thanks again!!
Updated in v 2.0 of the courses. Coming week of 12/19/2016. https://github.com/realpython/about/blob/master/changelog.csv
The structure of the pages returned at https://opendata.socrata.com/ no longer appear to be table-based, but are now a piles of nested divs. The code as it stands does nothing, and I'm messing around with xpaths to try to come up with the right expressions to pull the values I want to get here...