Requests generated through Media Pipeline is not using scrapy-impersonate download handler

Ehsan-U commented 5 months ago

Lets suppose this website: afdb.org/en/

The standard requests getting 200 Status but Files are getting 403 because Media pipeline is not using impersonate handler.

jxlil commented 5 months ago

Hi @Ehsan-U

It's necessary to set the impersonate key in Request.meta so that ImpersonateDownloadHanlder can handle the request, otherwise the request will be handled by the native Scrapy downloader.

In the case of FilesPipeline/ImagesPipeline it's required to override the get_media_requests function:

class CustomFilesPipeline(FilesPipeline):
     def __init__(self, store_uri, download_func=None, settings=None):
         super().__init__(store_uri, download_func, settings)

     def get_media_requests(self, item, info):
         urls = ItemAdapter(item).get(self.files_urls_field, [])
         return [Request(u, callback=NO_CALLBACK, meta={"impersonate": "chrome110"}) for u in urls]

This way the Requests now include the impersonate key and will be handled by the ImpersonateDownloadHandler downloader.

Ehsan-U commented 5 months ago

Thanks for the response, @jxlil

I've observed that requests generated from the media pipeline are downloaded directly by the Engine. So, just to clarify, are you certain that ImpersonateDownloadHandler will be used to handle the media requests?

jxlil commented 5 months ago

Yes, once you set the impersonate key all requests will be handled by scrapy-impersonate.

Note that for the site you mention (and probably most sites) a user-agent from some browser must also be added to prevent the Scrapy user-agent (Scrapy/2.11.1 (+https://scrapy.org)) from being sent; otherwise you will still get 403 status.

For example the following spider downloads the files using scrapy-impersonate:

from curl_cffi import CurlHttpVersion

class RequestsMiddleware(object):
    def process_request(self, request, spider):
        # the following Headers are set in all Requests
        request.headers.update(
            {
                "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36",
                "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
                "Accept-Language": "en",
                "Accept-Encoding": "gzip, deflate",
            }
        )

        # 'impersonate' key is set in Request.meta
        request.meta["impersonate"] = "chrome110"

        # added http_version argument to prevent curl errors
        request.meta["impersonate_args"] = {"http_version": CurlHttpVersion.V1_1}

class Example(Spider):
    name = "spider"
    start_urls = ["https://www.afdb.org/en"]

    custom_settings = {
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        "DOWNLOADER_MIDDLEWARES": {RequestsMiddleware: 1000},
        "DOWNLOAD_HANDLERS": {
            "http": ImpersonateDownloadHandler,
            "https": ImpersonateDownloadHandler,
        },
        "ITEM_PIPELINES": {FilesPipeline: 1},
        "FILES_STORE": "./files",
    }

    def parse(self, response: Response, **kwargs: Any) -> Any:
        item = FileItem()
        item["file_urls"] = [
            "https://www.afdb.org/sites/default/files/2019/07/05/high_5_feed_africa.pdf",
            "https://www.afdb.org/sites/default/files/2019/07/05/high_5_light_up_africa.pdf",
            "https://www.afdb.org/sites/default/files/2019/07/05/hight_5_industrialize_africa.pdf",
            "https://www.afdb.org/sites/default/files/2019/07/05/high_5_integrate_africa.pdf",
            "https://www.afdb.org/sites/default/files/2019/07/05/high_5_improve_quality_of_life.pdf",
        ]

        return item

With the previous example I was able to download the 5 files without any problem.

Note that it's no longer necessary to use CustomFilesPipeline since from the RequestsMiddleware middleware I'm adding the impersonate key and the necessary Headers.

I'm also sending the http_version argument (available from version 1.2.1) as I was getting a curl error when downloading the files:

2024-04-21 19:41:11 [scrapy.pipelines.files] WARNING: File (unknown-error): Error downloading file from <GET https://www.afdb.org/sites/default/files/2019/07/05/high_5_integrate_africa.pdf> referred in <None>: Failed to perform, curl: (55) Send failure: Broken pipe. See https://curl.se/libcurl/c/libcurl-errors.html first for more details.

Ehsan-U commented 5 months ago

still this curl error appearing for some files, not all

2024-05-02 04:02:25 [crawler.pipelines] WARNING: File (unknown-error): Error downloading file from <GET https://cites.org/sites/default/files/i/CITES_WWD_Brochure2014.pdf> referred in <None>: Failed to perform, curl: (55) Send failure: Connection reset by peer. See https://curl.se/libcurl/c/libcurl-errors.html first for more details.

jxlil commented 5 months ago

That error is related to curl_cffi. Open an issue in that project

jxlil / scrapy-impersonate

Requests generated through Media Pipeline is not using scrapy-impersonate download handler #6