microsoft / playwright

Playwright is a framework for Web Testing and Automation. It allows testing Chromium, Firefox and WebKit with a single API.
https://playwright.dev
Apache License 2.0
65.66k stars 3.57k forks source link

[BUG] PDF directly access does not emit download event #20771

Closed wizpresso-steve-cy-fan closed 1 year ago

wizpresso-steve-cy-fan commented 1 year ago

Context:

Code Snippet

import asyncio
import aiofiles
from playwright.async_api import async_playwright
import json

preference = {
    "plugins": {
        "always_open_pdf_externally": True,
        "open_pdf_in_system_reader": True
    },

}

async def handle(route):
    response = await route.fetch()
    if 'content-type' in response.headers and response.headers['content-type'] == 'application/pdf':
        response.headers['Content-Disposition'] = 'attachment'
    await route.fulfill(response=response, headers=response.headers)

async def main():
    async with aiofiles.tempfile.NamedTemporaryFile('w', suffix='.json') as f:
        await f.write(json.dumps(preference))
        await f.flush()
        await f.close()
        print(f.name)
        async with async_playwright() as p:
            browser = await p.chromium.launch(headless=True, channel="chrome", ignore_default_args=["--initial-preferences-file"], args=[fr'--initial-preferences-file="{f.name}"'])
            context = await browser.new_context(accept_downloads=True)
            try:
                await context.route("*", handle)
                page = await context.new_page()
                async with page.expect_download() as download_info:
                    await page.goto("https://www1.hkexnews.hk/listedco/listconews/gem/2023/0209/2023020900150_c.pdf")
                download = await download_info.value
                path = await download.path()
                print(path)
            finally:
                await context.close()

asyncio.run(main())

Describe the bug

I want to capture the PDF download so I have tested it by directly accessing to the PDF url, but it seems like it does not work as expected.

C:\Users\SteveFan\AppData\Local\Temp\tmp8wgnr4sg.json
Traceback (most recent call last):
  File "c:\Users\SteveFan\python\playwright-scrape\playwright_scrape\__main__.py", line 54, in <module>
    asyncio.run(main())
  File "C:\tools\Anaconda3\lib\asyncio\runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "C:\tools\Anaconda3\lib\asyncio\base_events.py", line 647, in run_until_complete
    return future.result()
  File "c:\Users\SteveFan\python\playwright-scrape\playwright_scrape\__main__.py", line 40, in main
    await page.goto("https://www1.hkexnews.hk/listedco/listconews/gem/2023/0209/2023020900150_c.pdf", timeout=0)
  File "C:\Users\SteveFan\python\playwright-scrape\.venv\lib\site-packages\playwright\async_api\_generated.py", line 9135, in goto
    await self._impl_obj.goto(
  File "C:\Users\SteveFan\python\playwright-scrape\.venv\lib\site-packages\playwright\_impl\_page.py", line 491, in goto
    return await self._main_frame.goto(**locals_to_params(locals()))
  File "C:\Users\SteveFan\python\playwright-scrape\.venv\lib\site-packages\playwright\_impl\_frame.py", line 147, in goto
    await self._channel.send("goto", locals_to_params(locals()))
  File "C:\Users\SteveFan\python\playwright-scrape\.venv\lib\site-packages\playwright\_impl\_connection.py", line 44, in send
    return await self._connection.wrap_api_call(
  File "C:\Users\SteveFan\python\playwright-scrape\.venv\lib\site-packages\playwright\_impl\_connection.py", line 419, in wrap_api_call
    return await cb()
  File "C:\Users\SteveFan\python\playwright-scrape\.venv\lib\site-packages\playwright\_impl\_connection.py", line 79, in inner_send
    result = next(iter(done)).result()
playwright._impl._api_types.Error: net::ERR_ABORTED at https://www1.hkexnews.hk/listedco/listconews/gem/2023/0209/2023020900150_c.pdf
=========================== logs ===========================
navigating to "https://www1.hkexnews.hk/listedco/listconews/gem/2023/0209/2023020900150_c.pdf", waiting until "load"
============================================================

Also I cannot set the initial-preferences-file to use my config in non-headless mode. The internal PDF viewer still opens.

wizpresso-steve-cy-fan commented 1 year ago

Regarding the preference args it seems to be a separate issue:

"C:\Program Files\Google\Chrome\Application\chrome.exe" --disable-field-trial-config --disable-background-networking --enable-features=NetworkService,NetworkServiceInProcess --disable-background-timer-throttling --disable-backgrounding-occluded-windows --disable-back-forward-cache --disable-breakpad --disable-client-side-phishing-detection --disable-component-extensions-with-background-pages --disable-component-update --no-default-browser-check --disable-default-apps --disable-dev-shm-usage --disable-extensions --disable-features=ImprovedCookieControls,LazyFrameLoading,GlobalMediaControls,DestroyProfileOnBrowserClose,MediaRouter,DialMediaRouteProvider,AcceptCHFrame,AutoExpandDetailsElement,CertificateTransparencyComponentUpdater,AvoidUnnecessaryBeforeUnloadCheckSync,Translate --allow-pre-commit-input --disable-hang-monitor --disable-ipc-flooding-protection --disable-popup-blocking --disable-prompt-on-repost --disable-renderer-backgrounding --disable-sync --force-color-profile=srgb --metrics-recording-only --no-first-run --enable-automation --password-store=basic --use-mock-keychain --no-service-autorun --export-tagged-pdf --no-sandbox "--initial-preferences-file=\"C:\Users\SteveFan\AppData\Local\Temp\tmpd3j4weyc.json\"" --user-data-dir=C:\Users\SteveFan\AppData\Local\Temp\playwright_chromiumdev_profile-Aw5szN --remote-debugging-pipe --no-startup-window
mxschmitt commented 1 year ago

Why don't you do just the following?

response = await page.request.get("https://www1.hkexnews.hk/listedco/listconews/gem/2023/0209/2023020900150_c.pdf")
print(response.status)
wizpresso-steve-cy-fan commented 1 year ago

@mxschmitt This is just an example and I will try to do a download interaction based on button click later. As if I clicked a button to download the file, it also uses goto behind the scene, so I think both should behave the same, I just want to do a simplification.

So far, this seems to be working:

import asyncio
from playwright.async_api import async_playwright
import json
from anyio import Path
from aiofiles.tempfile import TemporaryDirectory

preference = {
    "plugins": {
        "always_open_pdf_externally": True,
    },
}

async def handle(route):
    response = await route.fetch()
    if 'content-type' in response.headers and response.headers['content-type'] == 'application/pdf':
        response.headers['Content-Disposition'] = 'attachment'
    await route.fulfill(response=response, headers=response.headers)

async def main():
    async with TemporaryDirectory() as d:
        preference_dir = Path(d) / "Default"
        await preference_dir.mkdir(777, parents=True, exist_ok=True)
        await (preference_dir / "Preferences").write_text(json.dumps(preference))

        async with async_playwright() as p:
            context = await p.chromium.launch_persistent_context(d, headless=False, accept_downloads=True)
            try:
                await context.route("*", handle)
                page = await context.new_page()
                async with page.expect_download() as download_info:
                    try:
                        await page.goto("https://www1.hkexnews.hk/listedco/listconews/gem/2023/0209/2023020900150_c.pdf")
                    except:
                        download = await download_info.value
                        print(await download.path())
            finally:
                await context.close()

asyncio.run(main())

Combining the trick on https://github.com/microsoft/playwright/issues/3509#issuecomment-675441299 and https://stackoverflow.com/a/75201448/3289081

My end goal is to capture the PDF download and send the file stream into stdout/remote pipe.

N.B. Although I can go without making a persistent context to trigger the PDF download if I go headless, it apparently does not behave well in non-headless mode, so the suggestion at https://github.com/microsoft/playwright/issues/3509#issuecomment-1369299639 is not working.

wizpresso-steve-cy-fan commented 1 year ago

Closing this as it seems like posting on the Python repo would be better.