kaliiiiiiiiii / Selenium-Driverless

undetected Selenium without usage of chromedriver
https://kaliiiiiiiiii.github.io/Selenium-Driverless/
Other
487 stars 61 forks source link

driver.get throws TimeoutError on file downloads #140

Closed milahu closed 6 months ago

milahu commented 7 months ago

when i open some url with driver.get(url) and the server returns a file download (Content-Disposition: attachment) then driver.get hangs and throws a TimeoutError

as a workaround, im checking the ~/Downloads folder and wait until the file was written

the response can be either a file download, or an html error page, so driver.get(url, wait_load=False) would be wrong

CDP has the experimental Browser.downloadWillBegin

Fired when page is about to start a download.

but download_will_begin is never called

    # never called?
    def download_will_begin(*a, **k):
        print(f"download_will_begin: a={a} k={k}")

    await driver.add_cdp_listener("Browser.downloadWillBegin", download_will_begin)
    # wait for add_cdp_listener?!
    await asyncio.sleep(3)

    await driver.get(url, referrer=referrer, wait_load=False)
    await asyncio.sleep(99999999)

    await driver.remove_cdp_listener("Browser.downloadWillBegin", download_will_begin)

another possible solution: use a different downloads folder for each download

    await driver.get("chrome://settings/downloads")
    # click on #changeDownloadsPath
    # enter new downloads folder
    # click OK
    await driver.get(url, wait_load=False)
    # wait for the file to be created in the downloads folder
    # or check driver.page_source for html error page

another possible solution: run chromium in verbose mode and parse stdout log messages

another possible solution: monitor the chrome://downloads/ page which has source urls and destination filepaths the page is based on chrome://downloads/downloads.js see also How best to test file download links using Selenium see also Selenium Webdriver with Shadow DOM

kaliiiiiiiiii commented 7 months ago

but download_will_begin is never called

thanks for testing

await driver.get("chrome://settings/downloads")
# click on #changeDownloadsPath
# enter new downloads folder
# click OK

No need fo hackish stuff - there's Browser.setDownloadBehavior to set the path. Would however prefer a static download path.

another possible solution: run chromium in verbose mode and parse stdout log messages

That causes issues as subprocess.PIPE can make chrome hang when getting a lot of messages. see https://github.com/electron/electron/issues/35399#issuecomment-1751523513 And yeah - ofc we'd run a thread to processs the pipe stdout. However, I'm not sure about timing etc. as well.

chrome://downloads/

uh yeah, likely would even work with passing JS events. However, I'm not sure if that's gonna work with contexts (incognito) and ofc, we preferably don't want a open tab all the time.

My conclusion

all of those approaches seem a bit "hackish" to me. More convincing to me are here passive request events as discussed at https://github.com/kaliiiiiiiiii/Selenium-Driverless/issues/148#issuecomment-1889808446 as well.

Also: Possibly, the 20s timeout at https://github.com/kaliiiiiiiiii/Selenium-Driverless/blob/24a3513305f833fac600ec8e31bcd5e9df955162/src/selenium_driverless/types/target.py#L228 should not be hardcoded. (assumption: CMD returns when network request is done, relevant for large file downloads)

milahu commented 7 months ago

download_will_begin is never called

1. we must enable download events with Browser.setDownloadBehavior

eventsEnabled - boolean - Whether to emit download events (defaults to false).

2. we must send these CDP commands to driver.base_target not to driver and not to driver.current_target

then the Browser.downloadWillBegin event is fired

passive request events

file downloads are not available to Network.getResponseBody - it just hangs forever i could parse the original filename from the content-disposition header of the Network.responseReceived event, but i dont get the actual downloaded file path

in my aiohttp_chromium im using behavior = "allowAndName" to use the download's guid as basename in the downloads folder to avoid filename collisions, and to get stable filepaths

        base_target = await driver.base_target
        args = dict(
            behavior = "allowAndName", # use guid as basename
            downloadPath = self._downloads_path,
            eventsEnabled = True, # enable download events
        )
        await base_target.execute_cdp_cmd("Browser.setDownloadBehavior", args)
        await base_target.add_cdp_listener("Browser.downloadWillBegin", downloadWillBegin)
        await base_target.add_cdp_listener("Browser.downloadProgress", downloadProgress)

in the downloadWillBegin event, i get the filename as args["suggestedFilename"]

so this is not too complex, but requires setting up some event listeners which could interfere with listeners added by the user so maybe this is out of scope for Selenium-Driverless

kaliiiiiiiiii commented 7 months ago
  1. we must enable download events with Browser.setDownloadBehavior

Ah yep that's good to know, thanks.

so this is not too complex, but requires setting up some event listeners which could interfere with listeners added by the user so maybe this is out of scope for Selenium-Driverless

Actually, here I thought about having 2 sockets for each target. One for users and one for internals. This won't resolve all interfaces (such as network interception), however, event subscriptions are isolated to me knowledge.

milahu commented 7 months ago

i suggest to close this as "wontfix"

i assume that the original selenium has the same limitation and selenium_driverless should be a drop-in replacement for selenium (plus async)

and this is too complex to solve in selenium instead, consumers should wrap the driver.get call parse response headers, handle streams, handle CDP events, etc (this is why i started aiohttp_chromium)

kaliiiiiiiiii commented 7 months ago

and selenium_driverless should be a drop-in replacement for selenium (plus async)

Actually, I plan to move away from selenium. Selenium it's syntax, not allowing simultanous targets, contexts, missing & unconventional structure just is a pain. That's why I'm currently working on addin some documentation (finally).

and this is too complex to solve in selenium instead, consumers should wrap the driver.get call parse response headers, handle streams, handle CDP events, etc

Actually, I was thinking about fixing:

Possibly, the 20s timeout at

Selenium-Driverless/src/selenium_driverless/types/target.py

Line 228 in 24a3513 in get

asyncio.create_task(self.execute_cdp_cmd("Page.navigate", args, timeout=20)) 

should not be hardcoded. (assumption: CMD returns when network request is done, relevant for large file downloads)

for sure, then adding an option at ChromeOptions for the download behaviour and wait for first_completed at Browser.downloadWillBegin or loadEventFired

So maybe let's keep this issue up for now such that I don't forget about:)

kaliiiiiiiiii commented 6 months ago

This one's gonna be fixed by https://github.com/kaliiiiiiiiii/Selenium-Driverless/commit/51028b42dba02c9aec7b2b3d91a813556f0f2212#diff-6ddf02e9f090683f5175795f0bba0dcc9bc08e4e663bd0c3ab594eefc79b86eaR214-R252

kaliiiiiiiiii commented 6 months ago

This one's gonna be fixed by 51028b4#diff-6ddf02e9f090683f5175795f0bba0dcc9bc08e4e663bd0c3ab594eefc79b86eaR214-R252

resloved now. https://kaliiiiiiiiii.github.io/Selenium-Driverless/classes/ChromeOptions/#selenium_driverless.types.options.Options.downloads_dir

milahu commented 6 months ago

related:

before calling Network.getResponseBody we should wait for the Network.loadingFinished event otherwise Network.getResponseBody can return only the first 65536 bytes (64KiB) of the response

see also https://github.com/ChromeDevTools/devtools-protocol/issues/44 https://github.com/milahu/aiohttp_chromium/commit/23d47d9142528df90b9b8e5cb8fb2470ea8e82f4

this is the problem with CDP events: the user may want to handle CDP events and then handling these events in selenium_driverless is "too much"

the user will want to handle both events: Network.loadingFinished for inline content Browser.downloadWillBegin for file downloads

...or do you want to handle both events in selenium_driverless?

kaliiiiiiiiii commented 6 months ago

related:

before calling Network.getResponseBody we should wait for the Network.loadingFinished event otherwise Network.getResponseBody can return only the first 65536 bytes (64KiB) of the response

see also ChromeDevTools/devtools-protocol#44 milahu/aiohttp_chromium@0fbd29c

this is the problem with CDP events: the user may want to handle CDP events and then handling these events in selenium_driverless is "too much"

=> multiple WS connections to a target considerable.

the user will want to handle both events: Network.loadingFinished for inline content Browser.downloadWillBegin for file downloads

...or do you want to handle both events in selenium_driverless?

hm doesn't Page.navigate return after the full response has been received? At least for big file downloads, I noticed that it takes longer

milahu commented 6 months ago

doesn't Page.navigate return after the full response has been received?

aah, probably yes

i observed the "65536 bytes issue" when fetching assets (child requests)