kaliiiiiiiiii / Selenium-Driverless

undetected Selenium without usage of chromedriver
https://kaliiiiiiiiii.github.io/Selenium-Driverless/
Other
405 stars 52 forks source link

network interception with `Fetch.enable` breaks cloudflare #123

Closed milahu closed 2 months ago

milahu commented 7 months ago

im trying to capture all responses as described in readme#use-events

cloudflare says

Please unblock challenges.cloudflare.com to proceed.

chrome shows a warning in the address bar

your connection to this site is not secure

fixed by adding options.add_argument("--disable-web-security") to don't enforce the same-origin policy

test_selenium_driverless.py ```py #!/usr/bin/env python3 import asyncio import base64 import sys import time import traceback from cdp_socket.exceptions import CDPError from selenium_driverless import webdriver async def on_request(params, global_conn): url = params["request"]["url"] _params = {"requestId": params['requestId']} if params.get('responseStatusCode') in [301, 302, 303, 307, 308]: # redirected request return await global_conn.execute_cdp_cmd("Fetch.continueResponse", _params) else: try: body = await global_conn.execute_cdp_cmd("Fetch.getResponseBody", _params, timeout=1) except CDPError as e: if e.code == -32000 and e.message == 'Can only get response body on requests captured after headers received.': print(params, "\n", file=sys.stderr) traceback.print_exc() await global_conn.execute_cdp_cmd("Fetch.continueResponse", _params) else: raise e else: start = time.monotonic() body_decoded = base64.b64decode(body['body']) # modify body here body_modified = base64.b64encode(body_decoded).decode("ascii") fulfill_params = {"responseCode": 200, "body": body_modified} fulfill_params.update(_params) _time = time.monotonic() - start if _time > 0.01: print(f"decoding took long: {_time} s") await global_conn.execute_cdp_cmd("Fetch.fulfillRequest", fulfill_params) print("Mocked response", url) async def main(): options = webdriver.ChromeOptions() options.add_argument("--window-size=500,900") # fix: please unblock challenges.cloudflare.com to proceed # Don't enforce the same-origin policy options.add_argument("--disable-web-security") async with webdriver.Chrome(options=options, max_ws_size=2 ** 30) as driver: driver.base_target.socket.on_closed.append(lambda code, reason: print(f"chrome exited")) global_conn = driver.base_target await driver.get("about:blank") await global_conn.execute_cdp_cmd("Fetch.enable", cmd_args={"patterns": [{"requestStage": "Response", "urlPattern":"*"}]}) await global_conn.add_cdp_listener("Fetch.requestPaused", lambda data: on_request(data, global_conn)) await driver.get( #'https://wikipedia.org', "https://nowsecure.nl/#relax", # test cloudflare timeout=60, wait_load=False) while True: #time.sleep(10) # no. cloudflare would hang await asyncio.sleep(10) asyncio.run(main()) ```
kaliiiiiiiiii commented 7 months ago

I can confirm this. However, I suspect this to be a timing leak and cloudfare therefore sending a 403 back=> not really a way to fix. image image

@milahu or any other thoughts//ideas on that?

juhacz commented 7 months ago

The problem is that one of Cloudflare's engineers is watching this repository... :)

kaliiiiiiiiii commented 7 months ago

The problem is that one of Cloudflare's engineers is watching this repository... :)

@juhacz Likely, yes.

Soo in case some @cloudfare staff is reading this:

Why not hire me directly instead of needing someone to analyse & understand the code on here ? :)

juhacz commented 7 months ago

@kaliiiiiiiiii Because we need people like you more :) I Suggest creating a profile at https://www.buymeacoffee.com/ I think people will confirm my words :)

milahu commented 7 months ago

I suspect this to be a timing leak

you mean the python response handler is too slow?

or maybe the continueResponse/fulfillRequest logic has a bug (note: continueResponse is experimental)

but yeah, it seems to be a new problem with the error message "Please unblock challenges.cloudflare.com to proceed." i only find a tapatalk.com thread from 2023-10-30 with no solution

any other thoughts//ideas on that?

so far i used the "export HAR" function of chrome devtools network but that is slower than capturing the live traffic

the exported HAR file does not include the bodies of binary responses which is actually good for large binaries i dont want to store a 1GB response body in RAM but let chrome write it to the filesystem

chromium is open source, so it should be easy to find how the "record network log" command works

an alternative would be a local http proxy i guess Fetch.enable also works with a http proxy inside of chrome and maybe that proxy is visible to cloudflare

in the long term, they will replace captchas with government ID logins and to bypass that, we will need p2p scraping tools...

kaliiiiiiiiii commented 7 months ago

@kaliiiiiiiiii Because we need people like you more :) I Suggest creating a profile at https://www.buymeacoffee.com/ I think people will confirm my words :)

@juhacz added:) https://github.com/kaliiiiiiiiii#support-me

milahu commented 7 months ago

chromium is open source, so it should be easy to find how the "record network log" command works

chromium devtools sources [chromium/src/third_party/devtools-frontend/src/front_end/panels/network/network-meta.ts](https://source.chromium.org/chromium/chromium/src/+/main:third_party/devtools-frontend/src/front_end/panels/network/network-meta.ts;l=199) UIStrings.recordNetworkLog ```ts UI.ActionRegistration.registerActionExtension({ actionId: 'network.toggle-recording', category: UI.ActionRegistration.ActionCategory.NETWORK, iconClass: UI.ActionRegistration.IconClass.START_RECORDING, toggleable: true, toggledIconClass: UI.ActionRegistration.IconClass.STOP_RECORDING, toggleWithRedColor: true, contextTypes() { return maybeRetrieveContextTypes(Network => [Network.NetworkPanel.NetworkPanel]); }, async loadActionDelegate() { const Network = await loadNetworkModule(); return new Network.NetworkPanel.ActionDelegate(); }, options: [ { value: true, title: i18nLazyString(UIStrings.recordNetworkLog), }, { value: false, title: i18nLazyString(UIStrings.stopRecordingNetworkLog), }, ], ``` [chromium/src/third_party/devtools-frontend/src/front_end/panels/network/NetworkPanel.ts](https://source.chromium.org/chromium/chromium/src/+/main:third_party/devtools-frontend/src/front_end/panels/network/NetworkPanel.ts;l=933) network.toggle-recording ```ts export class ActionDelegate implements UI.ActionRegistration.ActionDelegate { handleAction(context: UI.Context.Context, actionId: string): boolean { const panel = context.flavor(NetworkPanel); if (panel === null) { return false; } switch (actionId) { case 'network.toggle-recording': { panel.toggleRecord(!panel.recordLogSetting.get()); return true; } ``` panel.toggleRecord ```ts toggleRecord(toggled: boolean): void { this.toggleRecordAction.setToggled(toggled); if (this.recordLogSetting.get() !== toggled) { this.recordLogSetting.set(toggled); } this.networkLogView.setRecording(toggled); if (!toggled && this.filmStripRecorder) { this.filmStripRecorder.stopRecording(this.filmStripAvailable.bind(this)); } } ``` this.filmStripRecorder ```ts private willReloadPage(): void { if (this.pendingStopTimer) { clearTimeout(this.pendingStopTimer); delete this.pendingStopTimer; } if (this.isShowing() && this.filmStripRecorder) { this.filmStripRecorder.startRecording(); } } ``` this.filmStripRecorder ```ts this.filmStripRecorder = new FilmStripRecorder(this.networkLogView.timeCalculator(), this.filmStripView); ``` FilmStripRecorder ```ts export class FilmStripRecorder implements TraceEngine.TracingManager.TracingManagerClient { // ... startRecording(): void { // ... const tracingManager = SDK.TargetManager.TargetManager.instance().scopeTarget()?.model(TraceEngine.TracingManager.TracingManager); // ... this.tracingManager = tracingManager; this.resourceTreeModel = this.tracingManager.target().model(SDK.ResourceTreeModel.ResourceTreeModel); this.tracingModel = new TraceEngine.Legacy.TracingModel(); void this.tracingManager.start(this, '-*,disabled-by-default-devtools.screenshot', ''); // ... } // ... stopRecording(callback: (filmStrip: TraceEngine.Extras.FilmStrip.Data) => void): void { // ... this.tracingManager.stop(); // ... } } ``` → `FilmStripRecorder implements TraceEngine.TracingManager.TracingManagerClient` SDK.TargetManager.TargetManager.instance ```ts import * as SDK from '../../core/sdk/sdk.js'; ``` [chromium/src/third_party/devtools-frontend/src/front_end/core/sdk/sdk.ts](https://source.chromium.org/chromium/chromium/src/+/main:third_party/devtools-frontend/src/front_end/core/sdk/sdk.ts) ```ts import * as TargetManager from './TargetManager.js'; ``` [chromium/src/third_party/devtools-frontend/src/front_end/core/sdk/TargetManager.ts](https://source.chromium.org/chromium/chromium/src/+/main:third_party/devtools-frontend/src/front_end/core/sdk/TargetManager.ts) TraceEngine.TracingManager.TracingManager [chromium/src/third_party/devtools-frontend/src/front_end/models/trace/TracingManager.ts](https://source.chromium.org/chromium/chromium/src/+/main:third_party/devtools-frontend/src/front_end/models/trace/TracingManager.ts) ```ts export class TracingManager extends SDK.SDKModel.SDKModel { readonly #tracingAgent: ProtocolProxyApi.TracingApi; // ... async start(client: TracingManagerClient, categoryFilter: string, options: string): Promise { // ... const args = { bufferUsageReportingInterval: bufferUsageReportingIntervalMs, categories: categoryFilter, options: options, transferMode: Protocol.Tracing.StartRequestTransferMode.ReportEvents, }; const response = await this.#tracingAgent.invoke_start(args); // ... } ``` [chromium/src/third_party/devtools-frontend/src/front_end/generated/protocol-proxy-api.d.ts](https://source.chromium.org/chromium/chromium/src/+/main:third_party/devtools-frontend/src/front_end/generated/protocol-proxy-api.d.ts;l=3393) ```ts /** * API generated from Protocol commands and events. */ declare namespace ProtocolProxyApi { // ... export interface TracingApi { // ... invoke_start(params: Protocol.Tracing.StartRequest): Promise; ``` bufferUsageReportingInterval
chromium sources [chromium/src/out/Debug/gen/content/browser/devtools/protocol/tracing.cc](https://source.chromium.org/chromium/chromium/src/+/main:out/Debug/gen/content/browser/devtools/protocol/tracing.cc;l=380) bufferUsageReportingInterval ```cc struct startParams : public crdtp::DeserializableProtocolObject { Maybe categories; Maybe options; Maybe bufferUsageReportingInterval; Maybe transferMode; Maybe streamFormat; Maybe streamCompression; Maybe traceConfig; Maybe perfettoConfig; Maybe tracingBackend; DECLARE_DESERIALIZATION_SUPPORT(); }; ``` startParams ```cc void DomainDispatcherImpl::start(const crdtp::Dispatchable& dispatchable) { // Prepare input parameters. auto deserializer = crdtp::DeferredMessage::FromSpan(dispatchable.Params())->MakeDeserializer(); startParams params; if (!startParams::Deserialize(&deserializer, ¶ms)) { ReportInvalidParams(dispatchable, deserializer); return; } m_backend->Start(std::move(params.categories), std::move(params.options), std::move(params.bufferUsageReportingInterval), std::move(params.transferMode), std::move(params.streamFormat), std::move(params.streamCompression), std::move(params.traceConfig), std::move(params.perfettoConfig), std::move(params.tracingBackend), std::make_unique(weakPtr(), dispatchable.CallId(), dispatchable.Serialized())); } ```

or simply: Tracing.start

kaliiiiiiiiii commented 7 months ago

@milahu

you mean the python response handler is too slow?

yep or maybe even the interception at C++ Chromium is to slow over a single websocket.

  1. Long-term workaround here would be ausing smth like selenium-wire, this however requires some development, to fix th SSL pinning.

or maybe the continueResponse/fulfillRequest logic has a bug (note: continueResponse is experimental)

Yep there for sure are some bugs. What I as well could think of is that maybe some iframes don't get intercepted correctly, and therefore have a detectable difference to the main frame.

so far i used the "export HAR" function of chrome devtools network but that is slower than capturing the live traffic

Yep that works as well of course, however more a workaround:)

an alternative would be a local http proxy i guess Fetch.enable also works with a http proxy inside of chrome and maybe that proxy is visible to cloudflare

See 1. I assumed chrome intercepts directly between frames | boringssl and doesn't tunnel it through a proxy after boringssl. Maybe we can find some source-code on that?

another thing to try is

  1. Network.setRequestInterception (deprecaded tho).

Soo feel free to share a POC & status if you try that

kaliiiiiiiiii commented 7 months ago

Yep there for sure are some bugs. What I as well could think of is that maybe some iframes don't get intercepted correctly, and therefore have a detectable difference to the main frame.

That would then explain why disabling site isolation works

milahu commented 7 months ago

interception

for my use case, i dont need any active interception of requests/responses i just need a passive live-stream of http traffic

so i will use Tracing.start

edit: no. the Tracing.dataCollected events are only sent after Tracing.end and the Tracing.dataCollected events dont contain http traffic 0__o

i still dont understand how devtools network log gets the live network traffic the network log uses Tracing.start only to get the trace categories "-*,disabled-by-default-devtools.screenshot"

milahu commented 7 months ago

an alternative would be a local http proxy

selenium-wire uses a patched version of mitmproxy as http proxy

this also allows for active network interception without chromium --disable-web-security because we can tell chromium to trust the proxy's certificate

kaliiiiiiiiii commented 7 months ago

an alternative would be a local http proxy

selenium-wire uses a patched version of mitmproxy as http proxy

this also allows for active network interception without chromium --disable-web-security because we can tell chromium to trust the proxy's certificate

still pretty sure the SSL/TLS fingerprint doesn't match to chrome as it doesn't use boringssl tho. see https://github.com/wkeeling/selenium-wire/issues/215#issuecomment-794362654

kaliiiiiiiiii commented 6 months ago

Interesting note here that:

from cdp_socket.utils.utils import launch_chrome, random_port
from cdp_socket.socket import CDPSocket
import os
import asyncio

global sock1

async def on_resumed(params):
    global sock1
    await sock1.exec("Fetch.continueRequest", {"requestId": params['requestId']})
    print(params["request"]["url"])

async def main():
    global sock1
    PORT = random_port()
    process = launch_chrome(PORT)

    async with CDPSocket(PORT) as base_socket:
        targets = await base_socket.targets
        target = targets[0]
        sock1 = await base_socket.get_socket(target)
        await sock1.exec("Network.clearBrowserCookies")
        await sock1.exec("Fetch.enable")
        sock1.add_listener("Fetch.requestPaused", on_resumed)
        await sock1.exec("Page.navigate", {"url": "https://nowsecure.nl#relax"})
        await asyncio.sleep(5)

    os.kill(process.pid, 15)

asyncio.run(main())

works just fine

milahu commented 6 months ago

works just fine

this works for requests, but not for responses because Fetch.getResponseBody always throws CDPError -32000

test.py ```py #!/usr/bin/env python3 # https://github.com/kaliiiiiiiiii/Selenium-Driverless/issues/123#issuecomment-1858803756 from cdp_socket.utils.utils import launch_chrome, random_port from cdp_socket.socket import CDPSocket from cdp_socket.exceptions import CDPError import os import asyncio import json import base64 import sys import time import traceback global sock1 async def on_request_paused(params): global sock1 url = params["request"]["url"] url_clean = url.split("?")[0] if len(url_clean) > 60: url_clean = url_clean[:60] + "..." _params = {"requestId": params['requestId']} #if params.get('responseStatusCode') in [301, 302, 303, 307, 308]: # # redirected request # return await sock1.exec("Fetch.continueResponse", _params) try: #print("Fetch.getResponseBody ...", url_clean) body = await sock1.exec("Fetch.getResponseBody", _params, timeout=30) except CDPError as e: #print("Fetch.getResponseBody CDPError", url_clean) if e.code == -32000: # Can only get response body on HeadersReceived pattern matched requests. print("Fetch.getResponseBody CDPError -32000 -> Fetch.continueResponse", url_clean) #print("Fetch.continueResponse ...", url_clean) res = await sock1.exec("Fetch.continueResponse", _params, timeout=30) #print("Fetch.continueResponse done", url_clean) return res else: print("Fetch.getResponseBody CDPError raise", url_clean) raise e else: print("Fetch.getResponseBody done", url_clean) start = time.monotonic() body_decoded = base64.b64decode(body['body']) # modify body here body_modified = base64.b64encode(body_decoded).decode("ascii") fulfill_params = {"responseCode": 200, "body": body_modified} fulfill_params.update(_params) _time = time.monotonic() - start if _time > 0.01: print(f"decoding took long: {_time} s") print("Fetch.fulfillRequest ...") res = await sock1.exec("Fetch.fulfillRequest", fulfill_params, timeout=30) print("Fetch.fulfillRequest done", url_clean) print("Mocked response", url_clean) return res async def main(): global sock1 PORT = random_port() process = launch_chrome(PORT) async with CDPSocket(PORT) as base_socket: targets = await base_socket.targets target = targets[0] sock1 = await base_socket.get_socket(target) await sock1.exec("Network.clearBrowserCookies") await sock1.exec("Fetch.enable") sock1.add_listener("Fetch.requestPaused", on_request_paused) # timeout: fix: asyncio.exceptions.TimeoutError await sock1.exec("Page.navigate", {"url": "https://nowsecure.nl#relax"}, timeout=30) print("waiting after Page.navigate") await asyncio.sleep(5) os.kill(process.pid, 30) asyncio.run(main()) ```
Fetch.getResponseBody CDPError -32000 -> Fetch.continueResponse https://nowsecure.nl/
waiting after Page.navigate
Fetch.getResponseBody CDPError -32000 -> Fetch.continueResponse https://nowsecure.nl/cdn-cgi/styles/challenges.css
Fetch.getResponseBody CDPError -32000 -> Fetch.continueResponse https://nowsecure.nl/cdn-cgi/challenge-platform/h/g/orchestr...
Fetch.getResponseBody CDPError -32000 -> Fetch.continueResponse https://challenges.cloudflare.com/turnstile/v0/g/74bd6362/ap...
Fetch.getResponseBody CDPError -32000 -> Fetch.continueResponse https://nowsecure.nl/favicon.ico
Fetch.getResponseBody CDPError -32000 -> Fetch.continueResponse https://nowsecure.nl/cdn-cgi/challenge-platform/h/g/flow/ov1...
Fetch.getResponseBody CDPError -32000 -> Fetch.continueResponse https://nowsecure.nl/favicon.ico
Fetch.getResponseBody CDPError -32000 -> Fetch.continueResponse https://challenges.cloudflare.com/cdn-cgi/challenge-platform...

similar... https://github.com/cloud-browser/scrapy-cloud-browser/blob/main/scrapy_cloud_browser/scenarist/page.py

milahu commented 6 months ago

chrome://net-export/ could be useful for passive capturing of traffic

Click the button to start logging future network activity to a file on disk. The log includes details of network activity from all of Chrome, including incognito and non-incognito tabs, visited URLs, and information about the network configuration

via chrome://net-internals/

kaliiiiiiiiii commented 6 months ago

Looks like Network.setRequestInterception has the same issues. WOnder tho why it's flaged as "Insecure", eventho the request is over HTTPS image image

```python import asyncio import base64 import sys import time import traceback from cdp_socket.exceptions import CDPError from selenium_driverless import webdriver async def on_request(params, global_conn): url = params["request"]["url"] _params = {"interceptionId": params['interceptionId']} if params.get('responseStatusCode') in [301, 302, 303, 307, 308]: # redirected request return await global_conn.execute_cdp_cmd("Network.continueInterceptedRequest", _params) else: try: body = await global_conn.execute_cdp_cmd("Network.getResponseBodyForInterception", _params, timeout=1) except CDPError as e: if e.code == -32000 and e.message == 'Can only get response body on requests captured after headers received.': print(params, "\n", file=sys.stderr) traceback.print_exc() await global_conn.execute_cdp_cmd("Fetch.continueResponse", _params) else: raise e else: start = time.monotonic() body_encoded = base64.b64decode(body['body']) # modify body here body_modified = base64.b64encode(body_encoded).decode() fulfill_params = {"rawResponse": body_modified} fulfill_params.update(_params) _time = time.monotonic() - start if _time > 0.01: print(f"decoding took long: {_time} s") await global_conn.execute_cdp_cmd("Network.continueInterceptedRequest", fulfill_params) print("Mocked response", url) async def main(): options = webdriver.ChromeOptions() async with webdriver.Chrome(options=options, max_ws_size=2 ** 30) as driver: driver.base_target.socket.on_closed.append(lambda code, reason: print(f"chrome exited")) global_conn = driver.current_target await driver.get("about:blank") await global_conn.execute_cdp_cmd("Network.enable", {"maxTotalBufferSize": 1_000_000, # 1GB "maxResourceBufferSize":1_000_000, "maxPostDataSize":1_000_000 }) await global_conn.execute_cdp_cmd("Network.setRequestInterception", {"patterns":[{"urlPattern":"*", "interceptionStage":"HeadersReceived"}]}) await global_conn.add_cdp_listener("Network.requestIntercepted", lambda data: on_request(data, global_conn)) await driver.get( 'https://nowsecure.nl', timeout=60, wait_load=False) while True: await asyncio.sleep(10) asyncio.run(main()) ```
milahu commented 6 months ago

wonder tho why it's flaged as "Insecure", eventho the request is over HTTPS

i guess it uses a local https proxy with a self-signed certificate without adding that certificate as "trusted cert" to ~/.pki/nssdb/

but still, this fails to bypass cloudflare

Please unblock challenges.cloudflare.com to proceed.

kaliiiiiiiiii commented 6 months ago

Also interesting here, that local overrides with the chrome devtools just work fine: image

i guess it uses a local https proxy with a self-signed certificate without adding that certificate as "trusted cert" to ~/.pki/nssdb/

ahh yep, that makes sense

but still, this fails to bypass cloudflare

maybe there's a way to detect self-signed certificate usage? If no, it's probably timing or SSL//TLS fingerprinting I guess

I see 2 possible aproaches here:

  1. check if we can access that over a chrome extensions (check if existing ones work) @milahu feel free to lmk if you find a workimg one. Getting the source-code & analysing shouldn't be that hard.
  2. What if we, instead of mofifying the body binary, point the url to a local webserver?
milahu commented 6 months ago

for now i gave up on intercepting requests... chrome seems to make it really hard, also to provide security against MITM attacks

probably i would try the frida route as i described in https://github.com/wkeeling/selenium-wire/issues/656#issuecomment-1848393185

kaliiiiiiiiii commented 6 months ago

for now i gave up on intercepting requests... chrome seems to make it really hard, also to provide security against MITM attacks

probably i would try the frida route as i described in wkeeling/selenium-wire#656 (comment)

Well yeah, eventho I assume that the memory manipulation//ddl hooking solutions are specific to:

kaliiiiiiiiii commented 6 months ago

At

chrome://net-export/ could be useful for passive capturing of traffic

Click the button to start logging future network activity to a file on disk. The log includes details of network activity from all of Chrome, including incognito and non-incognito tabs, visited URLs, and information about the network configuration

via chrome://net-internals/

Uhh I think passive capturing works as well with Fetch.enable or Network.setRequestInterception as long you don't modify the body btw

kaliiiiiiiiii commented 6 months ago

Even changing request headers works just fine

image

import asyncio
import base64
import sys
import time
import traceback

from cdp_socket.exceptions import CDPError

from selenium_driverless import webdriver

async def on_request(params, global_conn):
    url = params["request"]["url"]
    _params = {"interceptionId": params['interceptionId']}
    if params.get('responseStatusCode') in [301, 302, 303, 307, 308]:
        # redirected request
        return await global_conn.execute_cdp_cmd("Network.continueInterceptedRequest", _params)
    else:

        fulfill_params = {"headers":params["request"]["headers"]}
        fulfill_params["headers"]["test"] = "Hello World!"
        fulfill_params.update(_params)
        await global_conn.execute_cdp_cmd("Network.continueInterceptedRequest", fulfill_params)
        print(url)

async def main():
    options = webdriver.ChromeOptions()
    async with webdriver.Chrome(options=options, max_ws_size=2 ** 30) as driver:
        driver.base_target.socket.on_closed.append(lambda code, reason: print(f"chrome exited"))
        global_conn = driver.current_target
        await driver.get("about:blank")
        await global_conn.execute_cdp_cmd("Network.enable", {"maxTotalBufferSize": 1_000_000,  # 1GB
                                                             "maxResourceBufferSize": 1_000_000,
                                                             "maxPostDataSize": 1_000_000
                                                             })
        await global_conn.execute_cdp_cmd("Network.setRequestInterception",
                                          {"patterns": [{"urlPattern": "*",
                                                         # "interceptionStage": "HeadersReceived"
                                                         }]})
        await global_conn.add_cdp_listener("Network.requestIntercepted", lambda data: on_request(data, global_conn))
        await driver.get(
            'https://nowsecure.nl',
            timeout=60, wait_load=False)
        while True:
            await asyncio.sleep(10)

asyncio.run(main())
milahu commented 6 months ago

print(url)

and where is the response body?

milahu commented 5 months ago

and where is the response body?

Network.getResponseBody

```py #!/usr/bin/env python3 import asyncio from selenium_driverless import webdriver from selenium_driverless.types.by import By import base64 async def main(): driver = await webdriver.Chrome() #await asyncio.sleep(1) target = None async def requestWillBeSent(args): #print("requestWillBeSent", args) print("requestWillBeSent", args["request"]["url"]) async def requestWillBeSentExtraInfo(args): print("requestWillBeSentExtraInfo", args) async def responseReceived(args): # TODO better. get target of this response nonlocal target #print("responseReceived", args) status = args["response"]["status"] url = args["response"]["url"] _type = args["response"]["headers"]["Content-Type"] # TODO better. detect when response data is ready # fix: No data found for resource with given identifier await asyncio.sleep(1) args = { "requestId": args["requestId"], } body = await target.execute_cdp_cmd("Network.getResponseBody", args) body = base64.b64decode(body["body"]) if body["base64Encoded"] else body["body"] print("responseReceived", status, url, _type, repr(body[:20]) + "...") async def responseReceivedExtraInfo(args): print("responseReceivedExtraInfo", args) async def targetCreated(args): print("targetCreated", args) async def targetInfoChanged(args): #print("targetInfoChanged", args) print("targetInfoChanged") target = await driver.current_target #print("target.id", target.id) # enable Target events args = { "discover": True, #"filter": ... } await target.execute_cdp_cmd("Target.setDiscoverTargets", args) await target.add_cdp_listener("Target.targetCreated", targetCreated) await target.add_cdp_listener("Target.targetInfoChanged", targetInfoChanged) #print("driver.targets", await driver.targets) # enable Network events args = { "maxTotalBufferSize": 1_000_000, # 1GB "maxResourceBufferSize": 1_000_000, "maxPostDataSize": 1_000_000 } await target.execute_cdp_cmd("Network.enable", args) await target.add_cdp_listener("Network.requestWillBeSent", requestWillBeSent) #await target.add_cdp_listener("Network.requestWillBeSentExtraInfo", requestWillBeSentExtraInfo) await target.add_cdp_listener("Network.responseReceived", responseReceived) #await target.add_cdp_listener("Network.responseReceivedExtraInfo", responseReceivedExtraInfo) #await asyncio.sleep(1) url = "http://httpbin.org/get" print("driver.get", url) await driver.get(url) await asyncio.sleep(3) #print("driver.targets", await driver.targets) """ print("hit enter to close") input() """ await driver.close() asyncio.run(main()) ``` example output ``` driver.get http://httpbin.org/get requestWillBeSent http://httpbin.org/get targetInfoChanged requestWillBeSent http://httpbin.org/favicon.ico responseReceived 200 http://httpbin.org/get application/json '{\n "args": {}, \n "'... responseReceived 404 http://httpbin.org/favicon.ico text/html '
milahu commented 5 months ago

Please unblock challenges.cloudflare.com to proceed.

this error appears when Fetch.fulfillRequest has no response headers

fix:

    async def requestPaused(args):
        # ...
        body = base64.b64encode(body).decode("ascii")
        _args = {
            "requestId": args["requestId"],
            "responseCode": args["responseStatusCode"],
            # fix: Please unblock challenges.cloudflare.com to proceed.
            "responseHeaders": args["responseHeaders"],
            "body": body,
        }
        if args["responseStatusText"] != "":
            # empty string throws "Invalid http status code or phrase"
            _args["responsePhrase"] = args["responseStatusText"]
        await target.execute_cdp_cmd("Fetch.fulfillRequest", _args)

passive capturing works as well with Fetch.enable or Network.setRequestInterception as long you don't modify the body

im looking for a generic solution, based on streams so i can handle infinite-size responses without storing the whole response in RAM and so i can handle streams of events with low latency

see also https://github.com/milahu/aiohttp_chromium/tree/main/test/stream-response

feel free to copy/paste/modify these scripts to Selenium-Driverless/examples/

kaliiiiiiiiii commented 5 months ago

see also https://github.com/milahu/aiohttp_chromium/tree/main/test/stream-response

feel free to copy/paste/modify these scripts to Selenium-Driverless/examples/

ah yep, thanks. Might be nice if you can keep it up long-term somewhere in your repo for reference

broken: Network.enable and Network.streamResourceContent and Network.dataReceived - this is broken in chromium 117, because data is always empty.

ah heck, well then Network usage should probably be avoided as it's deprecated and more stuff might break in future chrome versions

Please unblock challenges.cloudflare.com to proceed.

this error appears when Fetch.fulfillRequest has no response headers

    async def requestPaused(args):
        # ...
        body = base64.b64encode(body).decode("ascii")
        _args = {
            "requestId": args["requestId"],
            "responseCode": args["responseStatusCode"],
            # fix: Please unblock challenges.cloudflare.com to proceed.
            "responseHeaders": args["responseHeaders"],
            "body": body,
        }
        if args["responseStatusText"] != "":
            # empty string throws "Invalid http status code or phrase"
           _args["responsePhrase"] = > args["responseStatusText"]
        await target.execute_cdp_cmd("Fetch.fulfillRequest", _args)

Uh nice that we've finally got it working! Great job! Wonder, is there any way to optimize base64.b64encode(body).decode("ascii") even more btw?

And also, are we sure that Fetch.enable intercepts as well:

  1. WebWorkers & service-workers
  2. cross//OOPIF iframes?
  3. background scripts in extensions.

I remember there being Network.setBypassServiceWorker, however no idea if it affects Fetch.enable as well.

If some still don't get intercepted, maybe target-interception might be considerable, see https://github.com/kaliiiiiiiiii/Selenium-Driverless/blob/4b71a5ab59a193d41eab80ed8f68a66e8ad5c230/tests/target_interception.py . I'm however not sure how reliable it is and how bad the timing leaks are.

milahu commented 5 months ago

then Network usage should probably be avoided as it's deprecated and more stuff might break in future chrome versions

Network.streamResourceContent and Network.dataReceived are not deprecated, but experimental so i expect them to work in newer versions

is there any way to optimize base64.b64encode(body).decode("ascii")

im afraid no... i also would prefer a binary protocol, no base64, no json

base64 is needed for Fetch.fulfillRequest

body: string: A response body. If absent, original response body will be used if the request is intercepted at the response stage and empty body will be used if the request is intercepted at the request stage. (Encoded as a base64 string when passed over JSON)

when i pass the body as bytes i get

TypeError: Object of type bytes is not JSON serializable

per CDP docs, the only non-JSON endpoint is

WebSocket /devtools/page/{targetId} The WebSocket endpoint for the protocol.

are we sure that Fetch.enable intercepts as well

no idea, i dont need these targets

in Fetch.requestPaused.py im calling

    target = await driver.current_target
    # ...
    await target.execute_cdp_cmd("Fetch.enable", args)
    await target.add_cdp_listener("Fetch.requestPaused", requestPaused)

but this also works with

    await driver.execute_cdp_cmd("Fetch.enable", args)
    await driver.add_cdp_listener("Fetch.requestPaused", requestPaused)

then requestPaused should be called for all targets

kaliiiiiiiiii commented 5 months ago

Also, I'm just thinking about - if we can't stream the responses when intercepting the requests - there's technically a way to detect the timing (if the server responds in chuncks), right?

And even if it would be possible, I suppose there could be a way to setup a server with sepecific chunk timing & size + detect that at JavaScript.

See http://scatter.cowchimp.com/ for a poc on scattering the chunk timing

milahu commented 5 months ago

aah, now i understand your question

are we sure that Fetch.enable intercepts as well

so ideally, all targets should be intercepted to add the same latency to all requests

practically, i would avoid this premature optimization because different latencies can have legitimate reasons like different cpu loads on different cpu cores

maybe put this on a todo list / future work list / debug ideas list in case cloudflare blocking becomes more aggressive

kaliiiiiiiiii commented 5 months ago
target = await driver.current_target
# ...
await target.execute_cdp_cmd("Fetch.

yeah ofc - as this will executes cdp on the same target.

I'm not sure if//how driver.base_target behaves tbh. I could imagine, that service-worker requests are only covered by base_target. At least for target interception, this is the case.

milahu commented 5 months ago

Network.streamResourceContent and Network.dataReceived are not deprecated, but experimental so i expect them to work in newer versions

bad news: this also fails with chromium 120

maybe this is a bug in selenium_driverless? tomorrow i will port Network.dataReceived.py to selenium i would be surprised if this is a chromium bug

kaliiiiiiiiii commented 5 months ago

Network.streamResourceContent and Network.dataReceived are not deprecated, but experimental so i expect them to work in newer versions

bad news: this also fails with chromium 120

maybe this is a bug in selenium_driverless? tomorrow i will port Network.dataReceived.py to selenium i would be surprised if this is a chromium bug

mhh maybe try with bare CDP-socket. Wouldn't know why driverless could break this. Unless it's some chrome flag which gets applied by default

milahu commented 5 months ago

tomorrow i will port Network.dataReceived.py to selenium

not possible, because chromedriver does not support the Network.streamResourceContent command

so there is no

await session.execute(devtools.network.stream_resource_content(request_id))
# or
driver.execute("Network.streamResourceContent", {"requestId": request_id})

there is only network.take_response_body_for_interception_as_stream

await session.execute(devtools.network.take_response_body_for_interception_as_stream(interception_id))

... but that requires an interception_id and there is still no IO.write so i cannot send the stream to chromium

see also Selenium 4: how add event listeners in CDP

CDP is broken by design?

i have the impression that this feature (reading and writing of streams) is deliberately not implemented by CDP

see also Fetch.fulfillRequest and (very) long body

Unfortunately, there's no streaming support for Fetch network interception at the moment

yeah, totally "unfortunately" and totally "at the moment"

no, i guess this is very deliberate sabotage, to prevent "abusing" chromium as a generic http client which is pretty much what we are trying to do here...

dynamic analysis

so... i really tried to avoid this part (because i have zero experience here) but i will have to use frida to insert hooks into the chromium binary

for now i gave up on intercepting requests... chrome seems to make it really hard, also to provide security against MITM attacks

probably i would try the frida route as i described in wkeeling/selenium-wire#656 (comment)

lets see what tomorrow will bring ; )

kaliiiiiiiiii commented 5 months ago

tomorrow i will port Network.dataReceived.py to selenium

not possible, because chromedriver does not support the Network.streamResourceContent command

so there is no

await session.execute(devtools.network.stream_resource_content(request_id))
# or
driver.execute("Network.streamResourceContent", {"requestId": request_id})

there is only network.take_response_body_for_interception_as_stream

await session.execute(devtools.network.take_response_body_for_interception_as_stream(interception_id))

... but that requires an interception_id and there is still no IO.write so i cannot send the stream to chromium

see also Selenium 4: how add event listeners in CDP

CDP is broken by design?

i have the impression that this feature (reading and writing of streams) is deliberately not implemented by CDP

Yeah that might indeed be the case. As well due to security reasons such as streaming all stuff encrypted trough a proxy.

see also Fetch.fulfillRequest and (very) long body

Unfortunately, there's no streaming support for Fetch network interception at the moment

yeah, totally "unfortunately" and totally "at the moment"

no, i guess this is very deliberate sabotage, to prevent "abusing" chromium as a generic http client which is pretty much what we are trying to do here...

yeah, I guess so

dynamic analysis

so... i really tried to avoid this part (because i have zero experience here) but i will have to use frida to insert hooks into the chromium binary

for now i gave up on intercepting requests... chrome seems to make it really hard, also to provide security against MITM attacks probably i would try the frida route as i described in wkeeling/selenium-wire#656 (comment)

lets see what tomorrow will bring ; )

well have funn hahe👀 gonna be a pain. Pretty sure Chrome has stuff against that implemented.

kaliiiiiiiiii commented 5 months ago

not resolved yet lol

milahu commented 5 months ago

well... the original issue is fixed by sending responseHeaders

currently i dont have time to implement reading and writing of streams also i guess this is out-of-scope for selenium_driverless because this is not possible with CDP

kaliiiiiiiiii commented 5 months ago

well... the original issue is fixed by sending responseHeaders

currently i dont have time to implement reading and writing of streams also i guess this is out-of-scope for selenium_driverless because this is not possible with CDP

Hmm does https://bugs.chromium.org/p/chromium/issues/detail?id=1138839 still apply tho? Also, I'm not that sure if all headers have the correct order tbh

Maybe using binaryResponseHeaders for continuing the request would be more safe?

kaliiiiiiiiii commented 5 months ago

Probably happens somewhere at https://source.chromium.org/chromium/chromium/src/+/main:third_party/blink/renderer/core/inspector/inspector_emulation_agent.cc;l=514;drc=d3d4ff28768842dd1ce94f408f89d1e2d31dd4fd

kaliiiiiiiiii commented 4 months ago

@milahu

probably i would try the frida route

Maybe https://github.com/tomer8007/chromium-ipc-sniffer could be a consideration worth👀 screenshot below id 4 years old, some stuff might have changed ofc.

milahu commented 4 months ago

i would be surprised if that works the raw HTTP traffic is hidden for better security

However, this project won't see anything that doesn't go over pipes, which is mostly shared memory IPC:

  • Mojo data pipe contents (raw networking buffers, audio, etc.)

... so the raw HTTP traffic is in shared memory

the most promising method is running chromium in a debugger, either gdb or lldb but i have to disable sandboxing to set breakpoints on BIO_read and BIO_write radare is too slow, frida fails to hook the functions gdb works, but parsing its output is slow, and gdb in python is kinda broken lldb would be better for interfacing with python (or native code), but its kinda broken... see also chromium-capture-http

but all these are workarounds and a proper fix would be to implement full http stream support to fix either Fetch.requestPaused.py or Network.dataReceived.py

effectively, this would allow inserting an http proxy with full control over request and response streams

its surprising that such a basic feature is missing

there is Fetch.takeResponseBodyAsStream and IO.read but not Fetch.giveResponseBodyAsStream and IO.write

there is Network.takeResponseBodyForInterceptionAsStream and IO.read but not Network.giveResponseBodyForInterceptionAsStream and IO.write

currently this has zero priority for me, i just dont need it

kaliiiiiiiiii commented 2 months ago

will be fixed with https://github.com/kaliiiiiiiiii/Selenium-Driverless/blob/dev/src/selenium_driverless/scripts/network_interceptor.py

I'll close this issue when it's released & the documentation is complete

kaliiiiiiiiii commented 2 months ago

resolved with https://kaliiiiiiiiii.github.io/Selenium-Driverless/api/RequestInterception/

milahu commented 1 month ago

a proper fix would be to implement full http stream support

nothing new from google https://issues.chromium.org/issues/332570739

just another feature request which would be easy to implement, but is ignored as "low priority"