BrunoHu commented 9 years ago

最近在学scrapy框架，觉得你写的这个实例不错，然后也按照最简单多方法写了一个爬虫同样是爬腾讯招聘，但是我发现虽然爬虫运行良好，但是始终爬不到第一页的数据，然后clone里你多程序试一试，发现你的程序同样有这个问题，所以想问问是哪里出了问题，我们一起进步一下。这里是主要部分的代码，运行后能同样爬出2000+的数据，但是就是没有第一页： class TencentSpider(CrawlSpider): name = "tencenthr"

download_delay = 1

allowed_domains = ["tencent.com"]
start_urls = ["http://hr.tencent.com/position.php"]

rules = [
    Rule(LinkExtractor(allow = ('/position.php\?&start=\d*#a',),restrict_xpaths=('//*[@id="next"]')), follow=True, callback='parse_item')
]

def parse_item(self, response):
    self.logger.info('Now is spidering in this page:   %s', response.url)
    base = response.xpath('//div[@id="position"]/div[1]/table/tr[@class="even" or @class="odd"]')
    pages = response.xpath('//a[@class="active"]/text()').extract()
    for sel in base:
        item = TencenthrItem()
        item['work'] = sel.xpath('td[1]/a/text()').extract()
        item['worktype'] = sel.xpath('td[2]/text()').extract()
        item['number'] = sel.xpath('td[3]/text()').extract()
        item['location'] = sel.xpath('td[4]/text()').extract()
        item['date'] = sel.xpath('td[5]/text()').extract()
        item['page'] = pages
        yield item

kevin1101 commented 8 years ago

抓取数据的时候，不是按正经顺序去爬页面的而且因为样式的原因，也没有从上往下存你可以再抓一次，然后手动对比一下

tangdouer commented 7 years ago

你的生成tencent,json文件了吗

maxliaops / scrapy-itzhaopin

第一页的数据没有爬下来，探讨解决 #2

download_delay = 1