kaliiiiiiiiii / Selenium-Driverless

undetected Selenium without usage of chromedriver
https://kaliiiiiiiiii.github.io/Selenium-Driverless/
Other
430 stars 52 forks source link

driver.page_source fails on non-html pages: CDPError: Could not find node with given id #148

Closed milahu closed 6 months ago

milahu commented 6 months ago

driver.page_source fails on non-html pages like http://httpbin.org/get

$ curl -I -s http://httpbin.org/get | grep -i ^Content-Type:
Content-Type: application/json
cdp_socket.exceptions.CDPError: {'code': -32000, 'message': 'Could not find node with given id'}
class WebElement(JSRemoteObj):
    # ...
    @property
    async def source(self):
        args = self._args_builder
        res = await self.__target__.execute_cdp_cmd("DOM.getOuterHTML", args)
        return res["outerHTML"]

obviously, DOM.getOuterHTML worky only on html pages

different from #127

javascript to the rescue...

possible solution: Page.FrameResource should give the mime type

possible solution: document.body.innerText gives the text of plain text pages document.body.firstChild.tagName == "PRE"

different mimetypes will need different solutions image/jpeg, image/png, image/*: Javascript: how to get image as bytes from a page (without redownloading) application/pdf: ? video/mp4, video/*: ?

possible solution: capture all responses by default, see also #123 then driver.response_bytes could return the original response bytes

kaliiiiiiiiii commented 6 months ago

hmm don't bare images for example get embedded//parsen automatically into HTML? If I remember it correctly - inspecting a page such as http://httpbin.org/get shows some embedded text within html.

possible solution: capture all responses by default, see also #123 then driver.response_bytes could return the original response bytes

Yep, thought about that as well. However, using Fetch for that then would interfere with users trying to implement request-interception themselves. options I'd se here are:

  1. provide some abstract class for Requests (however don't really have the time:/)
  2. use Network.enable for internally (passive) and advise using Fetch.enable fur users. However, I'm not sure if they still could interfere. And also, it's deprecated:/
  3. Use Fetch without interception (=> passive) on a new websocket connection. I do however not know if Fetch.enable is actually per websocket, and not globally per TargetId
kaliiiiiiiiii commented 6 months ago

@milahu Also, wouldn't Page.getResourceContent be considerable here as well? Why not use this one?

milahu commented 6 months ago

Page.getResourceContent

yes : )

# $ python3 -m asyncio

from selenium_driverless import webdriver
from selenium_driverless.types.by import By
driver = await webdriver.Chrome()
url = "http://httpbin.org/get"
await driver.get(url)

target = await driver.current_target
frame_id = target.id
args = { "frameId": frame_id, "url": url, }
res = await target.execute_cdp_cmd("Page.getResourceContent", args)
res["content"]
# '{\n  "args": {}, \n  ........ \n  "url": "http://httpbin.org/get"\n}\n'

Could not find node with given id

not sure where that error came from. in repl, it just works

# $ python3 -m asyncio

from selenium_driverless import webdriver
from selenium_driverless.types.by import By
driver = await webdriver.Chrome()
url = "http://httpbin.org/get"
await driver.get(url)

await driver.page_source
# '<html><head><meta name="color-scheme" content="light dark"></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">{\n  "args": {}, \n  ........ \n  "url": "http://httpbin.org/get"\n}\n</pre></body></html>'

elem = await driver.find_element(By.XPATH, "/html/body/pre")
await elem.text
# '{\n  "args": {}, \n  ........ \n  "url": "http://httpbin.org/get"\n}\n'

Fetch.enable

yes, see https://github.com/kaliiiiiiiiii/Selenium-Driverless/issues/123#issuecomment-1890393341