scrapy-plugins / scrapy-playwright

🎭 Playwright integration for Scrapy
BSD 3-Clause "New" or "Revised" License
911 stars 101 forks source link

Cannot download binary file (PDF) with Chromium headless=new mode #243

Open tommylge opened 7 months ago

tommylge commented 7 months ago

I am facing an issue when using chromium, when trying to download a PDF file: the response.body is the viewer plugin HTML, not the bytes.

There's already a concerned fix here: https://github.com/scrapy-plugins/scrapy-playwright/commit/0140b90381a0da92194661a0d13b7436661e80a0

It worked for a month, but not anymore, still getting the issue :/

My code hasn't changed since your fix that worked.

The related issue: https://github.com/scrapy-plugins/scrapy-playwright/issues/184

elacuesta commented 7 months ago

Please provide a minimal, reproducible example.

tommylge commented 7 months ago
import scrapy

class AwesomeSpider(scrapy.Spider):
    name = "test_dl"
    handle_httpstatus_list = [403]

    def start_requests(self):
        # GET request
        yield scrapy.Request(
            "https://defret.in/assets/certificates/attestation_secnumacademie.pdf",
            meta={
                "playwright": True,
                "playwright_page_goto_kwargs": {
                    "wait_until": "networkidle",
                },
            },
            callback=self.pasrse,
        )

    async def pasrse(self, response):
        print(response.body)

output:

<!DOCTYPE html><html><head></head><body style="height: 100%; width: 100%; overflow: hidden; margin:0px; background-color: rgb(38, 38, 38);"><embed name="4C80DFDA2738145655DE7937BDA51A0F" style="position:absolute; left: 0; top: 0;" width="100%" height="100%" src="about:blank" type="application/pdf" internalid="4C80DFDA2738145655DE7937BDA51A0F"></body></html>

instead of bytes

@elacuesta here the minimal, reproducible example.

elacuesta commented 7 months ago

Sorry, I cannot reproduce with scrapy-playwright 0.0.33 (3122f9cc8a32694fc2e7cbedc8511ca12e65d6a0).

import scrapy

class AwesomeSpider(scrapy.Spider):
    name = "test_dl"
    custom_settings = {
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        "DOWNLOAD_HANDLERS": {
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        # "PLAYWRIGHT_BROWSER_TYPE": "firefox",  # same result with chromium and firefox
    }

    def start_requests(self):
        yield scrapy.Request(
            "https://defret.in/assets/certificates/attestation_secnumacademie.pdf",
            meta={
                "playwright": True,
                "playwright_page_goto_kwargs": {
                    "wait_until": "networkidle",
                },
            },
        )

    async def parse(self, response):
        print("Response body size:", len(response.body))
        print("First bytes:")
        print(response.body[:200])
2023-11-16 13:46:09 [scrapy.core.engine] INFO: Spider opened
2023-11-16 13:46:09 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-11-16 13:46:09 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-11-16 13:46:09 [scrapy-playwright] INFO: Starting download handler
2023-11-16 13:46:14 [scrapy-playwright] INFO: Launching browser chromium
2023-11-16 13:46:14 [scrapy-playwright] INFO: Browser chromium launched
2023-11-16 13:46:14 [scrapy-playwright] DEBUG: Browser context started: 'default' (persistent=False, remote=False)
2023-11-16 13:46:14 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
2023-11-16 13:46:14 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://defret.in/assets/certificates/attestation_secnumacademie.pdf> (resource type: document)
2023-11-16 13:46:14 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://defret.in/assets/certificates/attestation_secnumacademie.pdf>
2023-11-16 13:46:15 [scrapy-playwright] WARNING: Navigating to <GET https://defret.in/assets/certificates/attestation_secnumacademie.pdf> returned None, the response will have empty headers and status 200
2023-11-16 13:46:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://defret.in/assets/certificates/attestation_secnumacademie.pdf> (referer: None) ['playwright']
Response body size: 1868169
First bytes:
b"%PDF-1.3\n%\xe2\xe3\xcf\xd3\n9 0 obj\n<< /Type /Page /Parent 1 0 R /LastModified (D:20200619180943+02'00') /Resources 2 0 R /MediaBox [0.000000 0.000000 841.890000 595.276000] /CropBox [0.000000 0.000000 841.890000 "
2023-11-16 13:46:15 [scrapy.core.engine] INFO: Closing spider (finished)
$ scrapy version -v
Scrapy       : 2.11.0
lxml         : 4.9.3.0
libxml2      : 2.10.3
cssselect    : 1.2.0
parsel       : 1.8.1
w3lib        : 2.1.2
Twisted      : 22.10.0
Python       : 3.10.0 (default, Oct  8 2021, 09:55:22) [GCC 7.5.0]
pyOpenSSL    : 23.2.0 (OpenSSL 3.1.2 1 Aug 2023)
cryptography : 41.0.3
Platform     : Linux-5.15.0-79-generic-x86_64-with-glibc2.35
tommylge commented 7 months ago

Okay thanks for your fast answer, pretty strange tho, tried with many different versions, always getting the issue. I guess i didn't debug enough yet and so seems like it doesn't comes from scrapy-playwright.

Could you tell us your playwright version please? I'll keep you up to date.

elacuesta commented 7 months ago

Could you tell us your playwright version please?

$ playwright --version               
Version 1.39.0
kinoute commented 7 months ago

@elacuesta We were able to narrow down the problem to two settings. First, using the new headless mode of Chrome, like this:

PLAYWRIGHT_LAUNCH_OPTIONS = {
      'args': [
          '--headless=new',
      ],
      'ignore_default_args': [
          '--headless',
      ],
}

Removing this doesn't think the problem alone. We had to rollback to the default value of the Scrapy setting REQUEST_FINGERPRINTER_IMPLEMENTATION which is 2.6 : https://docs.scrapy.org/en/latest/topics/request-response.html#request-fingerprinter-implementation

Setting it to 2.7, which seems recommended for new projects, make the problem appear again, whether the new headless chrome mode is enabled or not.

elacuesta commented 7 months ago

The REQUEST_FINGERPRINTER_IMPLEMENTATION setting is not relevant here, I tried several settings combinations and it did not change the results. The relevant part is the new Chromium headless mode, enabled as you mentioned:

PLAYWRIGHT_LAUNCH_OPTIONS = {
    'args': ['--headless=new'],
    'ignore_default_args': ['--headless'],
}

This looks like an upstream bug, the download event is not being fired with the new headless mode. I've opened an upstream Playwright issue (https://github.com/microsoft/playwright-python/issues/2169), although I suspect this is actually a Chromium issue.

kinoute commented 7 months ago

I just saw the update on your Playwright issue: do you think there is a chance you could integrate in your plug-in one of the workarounds posted to handle this? There are also other workarounds in the issues listed

elacuesta commented 7 months ago

I will have to take a look to see if the workaround applies in this case, as it was suggested way before the introduction of the new Chromium headless mode.

kinoute commented 7 months ago

Thanks for your help. For now, we try to detect the PDF viewer code when using Chromium and we redirect the download to a non-Playwright spider.

We basically compare the content-type returned by the response headers with the real content-type by analyzing the response.body. If the headers say application/pdf but the body says text/html, we redirect.

elacuesta commented 7 months ago

I'm a bit hesitant to include the mentioned workaround in the main package for now, but I realized it's possible to implement it with the existing API though the playwright_page_init_callback meta key. Hope that helps.

import re
import scrapy

async def init_page(page, request):
    async def handle_pdf(route):
        response = await page.context.request.get(route.request)
        await route.fulfill(
            response=response,
            headers={**response.headers, "Content-Disposition": "attachment"},
        )

    await page.route(re.compile(r".*\.pdf"), lambda route: handle_pdf(route))

class PdfSpider(scrapy.Spider):
    name = "pdf"
    custom_settings = {
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        "DOWNLOAD_HANDLERS": {
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        "PLAYWRIGHT_LAUNCH_OPTIONS": {
            "args": ["--headless=new"],
            "ignore_default_args": ["--headless"],
        },
    }

    def start_requests(self):
        yield scrapy.Request(
            "https://defret.in/assets/certificates/attestation_secnumacademie.pdf",
            meta={
                "playwright": True,
                "playwright_page_init_callback": init_page,
            },
        )

    async def parse(self, response):
        print("Response body size:", len(response.body))
        print("First bytes:")
        print(response.body[:200])
kinoute commented 7 months ago

Thanks for the code snippet! Unfortunately, it will not work for URLs that don't end with ".pdf" such as "?download=true" etc. We will try to figure something out and keep you updated.

elacuesta commented 7 months ago

Yes, that's exactly why I don't want to add the workaround to the main package :pensive: