scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
50.99k stars 10.34k forks source link

Media Pipeline is not filtering the duplicate file requests #6314

Closed Ehsan-U closed 2 weeks ago

Ehsan-U commented 2 weeks ago

The built-in RFPDupeFilter is working correctly for normal requests but it is not filtering duplicates requests generated through media pipeline. is this the expected behavior?

class FilePipeline(FilesPipeline):
    """
    Responsible for processing files 
    """

    def file_path(self, request: Request, response: Response = None, info: Dict =None, *, item: Dict = None) -> str:
        page = item['page']
        path = get_filepath(
            source=page.source, 
            current_page=page.url
        )
        return path

    def get_media_requests(self, item, info):
        media_reqs = []
        urls = ItemAdapter(item).get(self.files_urls_field, [])
        for url in urls:
            if 'www.afdb.org' in url:
                req = Request(url, callback=NO_CALLBACK, meta={"playwright": True})
            else:
                req = Request(url, callback=NO_CALLBACK)
            media_reqs.append(req)
        return media_reqs

    def file_downloaded(self, response, request, info, *, item=None):
        path = self.file_path(request, response=response, info=info, item=item)
        if path:
            buf = BytesIO(response.body)
            checksum = md5sum(buf)
            buf.seek(0)
            with open(path, "wb") as f:
                f.write(response.body)
            return checksum
wRAR commented 2 weeks ago

Yes, the dupefilter only applies to scheduled requests, not ones downloaded via ExecutionEngine.download().

Ehsan-U commented 2 weeks ago

Yes, the dupefilter only applies to scheduled requests, not ones downloaded via ExecutionEngine.download().

What's the ideal approach to handle duplicates in this case? Can use the RFPDupfilter instance in the File pipeline?

wRAR commented 2 weeks ago

Please ask questions about your code on suitable platforms: https://scrapy.org/community/