Open tommylge opened 7 months ago
Please provide a minimal, reproducible example.
import scrapy
class AwesomeSpider(scrapy.Spider):
name = "test_dl"
handle_httpstatus_list = [403]
def start_requests(self):
# GET request
yield scrapy.Request(
"https://defret.in/assets/certificates/attestation_secnumacademie.pdf",
meta={
"playwright": True,
"playwright_page_goto_kwargs": {
"wait_until": "networkidle",
},
},
callback=self.pasrse,
)
async def pasrse(self, response):
print(response.body)
output:
<!DOCTYPE html><html><head></head><body style="height: 100%; width: 100%; overflow: hidden; margin:0px; background-color: rgb(38, 38, 38);"><embed name="4C80DFDA2738145655DE7937BDA51A0F" style="position:absolute; left: 0; top: 0;" width="100%" height="100%" src="about:blank" type="application/pdf" internalid="4C80DFDA2738145655DE7937BDA51A0F"></body></html>
instead of bytes
@elacuesta here the minimal, reproducible example.
Sorry, I cannot reproduce with scrapy-playwright 0.0.33 (3122f9cc8a32694fc2e7cbedc8511ca12e65d6a0).
import scrapy
class AwesomeSpider(scrapy.Spider):
name = "test_dl"
custom_settings = {
"TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
"DOWNLOAD_HANDLERS": {
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},
# "PLAYWRIGHT_BROWSER_TYPE": "firefox", # same result with chromium and firefox
}
def start_requests(self):
yield scrapy.Request(
"https://defret.in/assets/certificates/attestation_secnumacademie.pdf",
meta={
"playwright": True,
"playwright_page_goto_kwargs": {
"wait_until": "networkidle",
},
},
)
async def parse(self, response):
print("Response body size:", len(response.body))
print("First bytes:")
print(response.body[:200])
2023-11-16 13:46:09 [scrapy.core.engine] INFO: Spider opened
2023-11-16 13:46:09 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-11-16 13:46:09 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-11-16 13:46:09 [scrapy-playwright] INFO: Starting download handler
2023-11-16 13:46:14 [scrapy-playwright] INFO: Launching browser chromium
2023-11-16 13:46:14 [scrapy-playwright] INFO: Browser chromium launched
2023-11-16 13:46:14 [scrapy-playwright] DEBUG: Browser context started: 'default' (persistent=False, remote=False)
2023-11-16 13:46:14 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
2023-11-16 13:46:14 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://defret.in/assets/certificates/attestation_secnumacademie.pdf> (resource type: document)
2023-11-16 13:46:14 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://defret.in/assets/certificates/attestation_secnumacademie.pdf>
2023-11-16 13:46:15 [scrapy-playwright] WARNING: Navigating to <GET https://defret.in/assets/certificates/attestation_secnumacademie.pdf> returned None, the response will have empty headers and status 200
2023-11-16 13:46:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://defret.in/assets/certificates/attestation_secnumacademie.pdf> (referer: None) ['playwright']
Response body size: 1868169
First bytes:
b"%PDF-1.3\n%\xe2\xe3\xcf\xd3\n9 0 obj\n<< /Type /Page /Parent 1 0 R /LastModified (D:20200619180943+02'00') /Resources 2 0 R /MediaBox [0.000000 0.000000 841.890000 595.276000] /CropBox [0.000000 0.000000 841.890000 "
2023-11-16 13:46:15 [scrapy.core.engine] INFO: Closing spider (finished)
$ scrapy version -v
Scrapy : 2.11.0
lxml : 4.9.3.0
libxml2 : 2.10.3
cssselect : 1.2.0
parsel : 1.8.1
w3lib : 2.1.2
Twisted : 22.10.0
Python : 3.10.0 (default, Oct 8 2021, 09:55:22) [GCC 7.5.0]
pyOpenSSL : 23.2.0 (OpenSSL 3.1.2 1 Aug 2023)
cryptography : 41.0.3
Platform : Linux-5.15.0-79-generic-x86_64-with-glibc2.35
Okay thanks for your fast answer, pretty strange tho, tried with many different versions, always getting the issue. I guess i didn't debug enough yet and so seems like it doesn't comes from scrapy-playwright.
Could you tell us your playwright version please? I'll keep you up to date.
Could you tell us your playwright version please?
$ playwright --version
Version 1.39.0
@elacuesta We were able to narrow down the problem to two settings. First, using the new headless mode of Chrome, like this:
PLAYWRIGHT_LAUNCH_OPTIONS = {
'args': [
'--headless=new',
],
'ignore_default_args': [
'--headless',
],
}
Removing this doesn't think the problem alone. We had to rollback to the default value of the Scrapy setting REQUEST_FINGERPRINTER_IMPLEMENTATION
which is 2.6 : https://docs.scrapy.org/en/latest/topics/request-response.html#request-fingerprinter-implementation
Setting it to 2.7, which seems recommended for new projects, make the problem appear again, whether the new headless chrome mode is enabled or not.
The REQUEST_FINGERPRINTER_IMPLEMENTATION
setting is not relevant here, I tried several settings combinations and it did not change the results. The relevant part is the new Chromium headless mode, enabled as you mentioned:
PLAYWRIGHT_LAUNCH_OPTIONS = {
'args': ['--headless=new'],
'ignore_default_args': ['--headless'],
}
This looks like an upstream bug, the download event is not being fired with the new headless mode. I've opened an upstream Playwright issue (https://github.com/microsoft/playwright-python/issues/2169), although I suspect this is actually a Chromium issue.
I just saw the update on your Playwright issue: do you think there is a chance you could integrate in your plug-in one of the workarounds posted to handle this? There are also other workarounds in the issues listed
I will have to take a look to see if the workaround applies in this case, as it was suggested way before the introduction of the new Chromium headless mode.
Thanks for your help. For now, we try to detect the PDF viewer code when using Chromium and we redirect the download to a non-Playwright spider.
We basically compare the content-type returned by the response headers with the real content-type by analyzing the response.body. If the headers say application/pdf
but the body says text/html
, we redirect.
I'm a bit hesitant to include the mentioned workaround in the main package for now, but I realized it's possible to implement it with the existing API though the playwright_page_init_callback
meta key. Hope that helps.
import re
import scrapy
async def init_page(page, request):
async def handle_pdf(route):
response = await page.context.request.get(route.request)
await route.fulfill(
response=response,
headers={**response.headers, "Content-Disposition": "attachment"},
)
await page.route(re.compile(r".*\.pdf"), lambda route: handle_pdf(route))
class PdfSpider(scrapy.Spider):
name = "pdf"
custom_settings = {
"TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
"DOWNLOAD_HANDLERS": {
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},
"PLAYWRIGHT_LAUNCH_OPTIONS": {
"args": ["--headless=new"],
"ignore_default_args": ["--headless"],
},
}
def start_requests(self):
yield scrapy.Request(
"https://defret.in/assets/certificates/attestation_secnumacademie.pdf",
meta={
"playwright": True,
"playwright_page_init_callback": init_page,
},
)
async def parse(self, response):
print("Response body size:", len(response.body))
print("First bytes:")
print(response.body[:200])
Thanks for the code snippet! Unfortunately, it will not work for URLs that don't end with ".pdf" such as "?download=true" etc. We will try to figure something out and keep you updated.
Yes, that's exactly why I don't want to add the workaround to the main package :pensive:
I am facing an issue when using chromium, when trying to download a PDF file: the response.body is the viewer plugin HTML, not the bytes.
There's already a concerned fix here: https://github.com/scrapy-plugins/scrapy-playwright/commit/0140b90381a0da92194661a0d13b7436661e80a0
It worked for a month, but not anymore, still getting the issue :/
My code hasn't changed since your fix that worked.
The related issue: https://github.com/scrapy-plugins/scrapy-playwright/issues/184