scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
52.81k stars 10.52k forks source link

Using request callback in pipeline does not seem to work #3185

Open fabrepe opened 6 years ago

fabrepe commented 6 years ago

I am using a custom FilesPipeline to download pdf files. The input item embed a pdfLink attribute that point to the wrapper of the pdf. The pdf itself is embedded as an iframe in the link given by the pdfLink attribute.

I then build the following pipeline :

import scrapy
from scrapy.pipelines.files import FilesPipeline

class PdfPipeline(FilesPipeline):
    def get_media_requests(self, item, spider):
        yield scrapy.Request(item['pdfLink'],
            callback=self.get_pdfurl)

    def get_pdfurl(self, response):
        import logging
        logging.info('...............')
        print response.url
        yield scrapy.Request(response.css('iframe::attr(src)').extract()[0])

However the get_url callback does not seem to be triggered, as

  1. neither logs or print function are shown
  2. the file that is stored is the source code (html) of the wrapper page, located at item.get['pdfLink']

Is it really possible to use Request callback in pipelines ? I am doing something wrong ?

grammy-jiang commented 6 years ago

Hi, @fabrepe ,

In scrapy, the pipeline is designed to only deal with items from spiders - saving the items, washing the items, dropping the items, etc. No more request can be sent from pipeline, and you can refer to the architecture of scrapy here: Architecture overview — Scrapy 1.5.0 documentation. For deep reasons, you could read the source code, find the difference of motivated ways between the spdier, downloader and pipelines.

For your question, you should parse the url first in the spider, then yield the item contained the url of the pdf. Read the relative documents carefully, they are much helpful!

PS, remember: the data flow between other parts and pipelines is single direction - only flow into pipelines, nothing can feed back from pipelines (unless you use signals, which will be another huge topic).

fabrepe commented 6 years ago

Hi @grammy-jiang and thank you very much for your response and pointing out the architecture, This led me to review, one more time :) the documentation, and especially Item pipeline - Take screenshot of item.

I then found a workaround, by using two consecutive pipelines:

class PdfWrapperPipeline(object):
    def process_item(self, item, spider):
        wrapper_url = self.WRAPPER_URL.format(item.get('pdfLink'))
        request = scrapy.Request(item.get('pdfLink'))
        dfd = spider.crawler.engine.download(request, spider)
        dfd.addBoth(self.return_item, item)
        return dfd

    def return_item(self, response, item):
        if response.status != 200:
            # Error happened, return item.
            return item

        url = response.css('iframe::attr(src)').extract()[0]
        item['pdfUrl'] = url
        return item

class PdfPipeline(FilesPipeline):
    def get_media_requests(self, item, spider):
        yield scrapy.Request(item.get('pdfUrl'))

and then set in settings.py the wrapper pipeline priority higher than the pdf pipeline priority.

ITEM_PIPELINES = {
    'project.pipelines.PdfWrapperPipeline': 1,
    'project.pipelines.PdfPipeline': 2,
} 
grammy-jiang commented 6 years ago

Hi, @fabrepe ,

Your solution is very impressive...

whypro commented 6 years ago

Why not process pdf link in spiders and only download the real pdf in pipeline?