scrapy-plugins / scrapy-playwright

🎭 Playwright integration for Scrapy
BSD 3-Clause "New" or "Revised" License
911 stars 101 forks source link

Cannot download binary file (PDF) any browser #275

Closed tommylge closed 2 weeks ago

tommylge commented 2 weeks ago
[2024-06-19 13:46:07,560][ERROR] [Url-Check] Unkown <Type: <class 'TypeError'>> exception: {}
[2024-06-19 13:46:07,665][ERROR] [Base Spider] error: <twisted.python.failure.Failure builtins.TypeError: expected string or bytes-like object, got 'NoneType'>
[2024-06-19 13:46:07,666][ERROR] Traceback (most recent call last):
[2024-06-19 13:46:07,666][ERROR]   File "/opt/venv/lib/python3.11/site-packages/twisted/internet/defer.py", line 916, in errback
[2024-06-19 13:46:07,667][ERROR]     self._startRunCallbacks(fail)
[2024-06-19 13:46:07,667][ERROR]   File "/opt/venv/lib/python3.11/site-packages/twisted/internet/defer.py", line 984, in _startRunCallbacks
[2024-06-19 13:46:07,667][ERROR]     self._runCallbacks()
[2024-06-19 13:46:07,668][ERROR]   File "/opt/venv/lib/python3.11/site-packages/twisted/internet/defer.py", line 1078, in _runCallbacks
[2024-06-19 13:46:07,668][ERROR]     current.result = callback(  # type: ignore[misc]
[2024-06-19 13:46:07,668][ERROR]   File "/opt/venv/lib/python3.11/site-packages/twisted/internet/defer.py", line 1949, in _gotResultInlineCallbacks
[2024-06-19 13:46:07,669][ERROR]     _inlineCallbacks(r, gen, status, context)
[2024-06-19 13:46:07,669][ERROR] --- <exception caught here> ---
[2024-06-19 13:46:07,669][ERROR]   File "/opt/venv/lib/python3.11/site-packages/twisted/internet/defer.py", line 1999, in _inlineCallbacks
[2024-06-19 13:46:07,670][ERROR]     result = context.run(
[2024-06-19 13:46:07,670][ERROR]   File "/opt/venv/lib/python3.11/site-packages/twisted/python/failure.py", line 519, in throwExceptionIntoGenerator
[2024-06-19 13:46:07,671][ERROR]     return g.throw(self.value.with_traceback(self.tb))
[2024-06-19 13:46:07,671][ERROR]   File "/opt/venv/lib/python3.11/site-packages/scrapy/core/downloader/middleware.py", line 54, in process_request
[2024-06-19 13:46:07,672][ERROR]     return (yield download_func(request=request, spider=spider))
[2024-06-19 13:46:07,672][ERROR]   File "/opt/venv/lib/python3.11/site-packages/twisted/internet/defer.py", line 1251, in adapt
[2024-06-19 13:46:07,672][ERROR]     extracted: _SelfResultT | Failure = result.result()
[2024-06-19 13:46:07,673][ERROR]   File "/opt/venv/lib/python3.11/site-packages/scrapy_playwright/handler.py", line 340, in _download_request
[2024-06-19 13:46:07,673][ERROR]     return await self._download_request_with_page(request, page, spider)
[2024-06-19 13:46:07,673][ERROR]   File "/opt/venv/lib/python3.11/site-packages/scrapy_playwright/handler.py", line 406, in _download_request_with_page
[2024-06-19 13:46:07,673][ERROR]     raise download["exception"]
[2024-06-19 13:46:07,673][ERROR]   File "/opt/venv/lib/python3.11/site-packages/scrapy_playwright/handler.py", line 446, in _handle_download
[2024-06-19 13:46:07,674][ERROR]     if failure := await dwnld.failure():
[2024-06-19 13:46:07,674][ERROR]   File "/opt/venv/lib/python3.11/site-packages/playwright/async_api/_generated.py", line 6843, in failure
[2024-06-19 13:46:07,674][ERROR]     return mapping.from_maybe_impl(await self._impl_obj.failure())
[2024-06-19 13:46:07,674][ERROR]   File "/opt/venv/lib/python3.11/site-packages/playwright/_impl/_download.py", line 55, in failure
[2024-06-19 13:46:07,674][ERROR]     return await self._artifact.failure()
[2024-06-19 13:46:07,674][ERROR]   File "/opt/venv/lib/python3.11/site-packages/playwright/_impl/_artifact.py", line 45, in failure
[2024-06-19 13:46:07,675][ERROR]     return patch_error_message(await self._channel.send("failure"))
[2024-06-19 13:46:07,675][ERROR]   File "/opt/venv/lib/python3.11/site-packages/playwright/_impl/_helper.py", line 228, in patch_error_message
[2024-06-19 13:46:07,675][ERROR]     match = re.match(r"(\w+)(: expected .*)", message)
[2024-06-19 13:46:07,676][ERROR]   File "/usr/local/lib/python3.11/re/__init__.py", line 166, in match
[2024-06-19 13:46:07,676][ERROR]     return _compile(pattern, flags).match(string)
[2024-06-19 13:46:07,677][ERROR] builtins.TypeError: expected string or bytes-like object, got 'NoneType'

Got this error, with webkit, firefox and chromium in headless: True.

Code to reproduce:

class AwesomeSpider(scrapy.Spider):
    name = "test_dl"

    def start_requests(self):
        # GET request
        yield scrapy.Request(
            "https://defret.in/assets/certificates/attestation_secnumacademie.pdf",
            meta={
                "playwright": True,
            },
            callback=self.pasrse,
        )

    async def pasrse(self, response):
        print(response.body)
elacuesta commented 2 weeks ago

Cannot reproduce. Printing the first 100 chars with print(response.body[:100]):

2024-06-19 11:32:50 [scrapy.core.engine] INFO: Spider opened
2024-06-19 11:32:50 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-06-19 11:32:50 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-06-19 11:32:50 [scrapy-playwright] INFO: Starting download handler
2024-06-19 11:32:50 [scrapy-playwright] INFO: Starting download handler
2024-06-19 11:32:55 [scrapy-playwright] INFO: Launching browser chromium
2024-06-19 11:32:56 [scrapy-playwright] INFO: Browser chromium launched
2024-06-19 11:32:56 [scrapy-playwright] DEBUG: Browser context started: 'default' (persistent=False, remote=False)
2024-06-19 11:32:56 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
2024-06-19 11:32:56 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://....pdf> (resource type: document)
2024-06-19 11:32:56 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://....pdf>
2024-06-19 11:32:56 [scrapy-playwright] WARNING: Navigating to <GET https://....pdf> returned None, the response will have empty headers and status 200
2024-06-19 11:32:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://....pdf> (referer: None) ['playwright']
b"%PDF-1.3\n%\xe2\xe3\xcf\xd3\n9 0 obj\n<< /Type /Page /Parent 1 0 R /LastModified (D:20200619180943+02'00') /Resourc"
2024-06-19 11:32:56 [scrapy.core.engine] INFO: Closing spider (finished)
2024-06-19 11:32:56 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
$ python -c "import scrapy_playwright; print(scrapy_playwright.__version__)"
0.0.35

$ scrapy version -v
Scrapy       : 2.11.2
lxml         : 4.9.3.0
libxml2      : 2.10.3
cssselect    : 1.2.0
parsel       : 1.8.1
w3lib        : 2.1.2
Twisted      : 24.3.0
Python       : 3.10.10 (main, Feb 16 2023, 02:58:25) [Clang 14.0.0 (clang-1400.0.29.202)]
pyOpenSSL    : 23.2.0 (OpenSSL 3.1.2 1 Aug 2023)
cryptography : 41.0.3
Platform     : macOS-14.4.1-x86_64-i386-64bit
tommylge commented 2 weeks ago

Found this issue on playwright-python repo. https://github.com/microsoft/playwright-python/issues/2408

Appears to come from them. A fix has been released, seems like i don't got the right pw version. Sorry for that and thanks for your answer :)