Closed Ehsan-U closed 5 months ago
Hi @Ehsan-U
It's necessary to set the impersonate
key in Request.meta so that ImpersonateDownloadHanlder
can handle the request, otherwise the request will be handled by the native Scrapy downloader.
In the case of FilesPipeline
/ImagesPipeline
it's required to override the get_media_requests
function:
class CustomFilesPipeline(FilesPipeline):
def __init__(self, store_uri, download_func=None, settings=None):
super().__init__(store_uri, download_func, settings)
def get_media_requests(self, item, info):
urls = ItemAdapter(item).get(self.files_urls_field, [])
return [Request(u, callback=NO_CALLBACK, meta={"impersonate": "chrome110"}) for u in urls]
This way the Requests now include the impersonate
key and will be handled by the ImpersonateDownloadHandler
downloader.
Thanks for the response, @jxlil
I've observed that requests generated from the media pipeline are downloaded directly by the Engine
. So, just to clarify, are you certain that ImpersonateDownloadHandler
will be used to handle the media requests?
Yes, once you set the impersonate
key all requests will be handled by scrapy-impersonate.
Note that for the site you mention (and probably most sites) a user-agent from some browser must also be added to prevent the Scrapy user-agent (Scrapy/2.11.1 (+https://scrapy.org)
) from being sent; otherwise you will still get 403 status.
For example the following spider downloads the files using scrapy-impersonate:
from curl_cffi import CurlHttpVersion
class RequestsMiddleware(object):
def process_request(self, request, spider):
# the following Headers are set in all Requests
request.headers.update(
{
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en",
"Accept-Encoding": "gzip, deflate",
}
)
# 'impersonate' key is set in Request.meta
request.meta["impersonate"] = "chrome110"
# added http_version argument to prevent curl errors
request.meta["impersonate_args"] = {"http_version": CurlHttpVersion.V1_1}
class Example(Spider):
name = "spider"
start_urls = ["https://www.afdb.org/en"]
custom_settings = {
"TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
"DOWNLOADER_MIDDLEWARES": {RequestsMiddleware: 1000},
"DOWNLOAD_HANDLERS": {
"http": ImpersonateDownloadHandler,
"https": ImpersonateDownloadHandler,
},
"ITEM_PIPELINES": {FilesPipeline: 1},
"FILES_STORE": "./files",
}
def parse(self, response: Response, **kwargs: Any) -> Any:
item = FileItem()
item["file_urls"] = [
"https://www.afdb.org/sites/default/files/2019/07/05/high_5_feed_africa.pdf",
"https://www.afdb.org/sites/default/files/2019/07/05/high_5_light_up_africa.pdf",
"https://www.afdb.org/sites/default/files/2019/07/05/hight_5_industrialize_africa.pdf",
"https://www.afdb.org/sites/default/files/2019/07/05/high_5_integrate_africa.pdf",
"https://www.afdb.org/sites/default/files/2019/07/05/high_5_improve_quality_of_life.pdf",
]
return item
With the previous example I was able to download the 5 files without any problem.
Note that it's no longer necessary to use CustomFilesPipeline
since from the RequestsMiddleware
middleware I'm adding the impersonate
key and the necessary Headers.
I'm also sending the http_version
argument (available from version 1.2.1) as I was getting a curl
error when downloading the files:
2024-04-21 19:41:11 [scrapy.pipelines.files] WARNING: File (unknown-error): Error downloading file from <GET https://www.afdb.org/sites/default/files/2019/07/05/high_5_integrate_africa.pdf> referred in <None>: Failed to perform, curl: (55) Send failure: Broken pipe. See https://curl.se/libcurl/c/libcurl-errors.html first for more details.
still this curl
error appearing for some files, not all
2024-05-02 04:02:25 [crawler.pipelines] WARNING: File (unknown-error): Error downloading file from <GET https://cites.org/sites/default/files/i/CITES_WWD_Brochure2014.pdf> referred in <None>: Failed to perform, curl: (55) Send failure: Connection reset by peer. See https://curl.se/libcurl/c/libcurl-errors.html first for more details.
Lets suppose this website: afdb.org/en/
The standard requests getting 200 Status but Files are getting 403 because Media pipeline is not using impersonate handler.