Closed milahu closed 2 months ago
I can confirm this. However, I suspect this to be a timing leak and cloudfare therefore sending a 403 back=> not really a way to fix.
@milahu or any other thoughts//ideas on that?
The problem is that one of Cloudflare's engineers is watching this repository... :)
The problem is that one of Cloudflare's engineers is watching this repository... :)
@juhacz Likely, yes.
Why not hire me directly instead of needing someone to analyse & understand the code on here ? :)
@kaliiiiiiiiii Because we need people like you more :) I Suggest creating a profile at https://www.buymeacoffee.com/ I think people will confirm my words :)
I suspect this to be a timing leak
you mean the python response handler is too slow?
or maybe the continueResponse/fulfillRequest logic has a bug (note: continueResponse is experimental)
but yeah, it seems to be a new problem with the error message "Please unblock challenges.cloudflare.com to proceed." i only find a tapatalk.com thread from 2023-10-30 with no solution
any other thoughts//ideas on that?
so far i used the "export HAR" function of chrome devtools network but that is slower than capturing the live traffic
the exported HAR file does not include the bodies of binary responses which is actually good for large binaries i dont want to store a 1GB response body in RAM but let chrome write it to the filesystem
chromium is open source, so it should be easy to find how the "record network log" command works
an alternative would be a local http proxy
i guess Fetch.enable
also works with a http proxy inside of chrome
and maybe that proxy is visible to cloudflare
in the long term, they will replace captchas with government ID logins and to bypass that, we will need p2p scraping tools...
@kaliiiiiiiiii Because we need people like you more :) I Suggest creating a profile at https://www.buymeacoffee.com/ I think people will confirm my words :)
@juhacz added:) https://github.com/kaliiiiiiiiii#support-me
chromium is open source, so it should be easy to find how the "record network log" command works
or simply: Tracing.start
@milahu
you mean the python response handler is too slow?
yep or maybe even the interception at C++ Chromium is to slow over a single websocket.
or maybe the continueResponse/fulfillRequest logic has a bug (note: continueResponse is experimental)
Yep there for sure are some bugs. What I as well could think of is that maybe some iframes don't get intercepted correctly, and therefore have a detectable difference to the main frame.
so far i used the "export HAR" function of chrome devtools network but that is slower than capturing the live traffic
Yep that works as well of course, however more a workaround:)
an alternative would be a local http proxy i guess
Fetch.enable
also works with a http proxy inside of chrome and maybe that proxy is visible to cloudflare
See 1.
I assumed chrome intercepts directly between frames | boringssl
and doesn't tunnel it through a proxy after boringssl.
Maybe we can find some source-code on that?
another thing to try is
Network.setRequestInterception
(deprecaded tho). Soo feel free to share a POC & status if you try that
Yep there for sure are some bugs. What I as well could think of is that maybe some iframes don't get intercepted correctly, and therefore have a detectable difference to the main frame.
That would then explain why disabling site isolation works
interception
for my use case, i dont need any active interception of requests/responses i just need a passive live-stream of http traffic
so i will use Tracing.start
edit: no. the Tracing.dataCollected
events are only sent after Tracing.end
and the Tracing.dataCollected
events dont contain http traffic 0__o
i still dont understand how devtools network log gets the live network traffic
the network log uses Tracing.start only to get the trace categories
"-*,disabled-by-default-devtools.screenshot"
an alternative would be a local http proxy
selenium-wire uses a patched version of mitmproxy as http proxy
this also allows for active network interception
without chromium --disable-web-security
because we can tell chromium to trust the proxy's certificate
an alternative would be a local http proxy
selenium-wire uses a patched version of mitmproxy as http proxy
this also allows for active network interception without
chromium --disable-web-security
because we can tell chromium to trust the proxy's certificate
still pretty sure the SSL/TLS fingerprint doesn't match to chrome as it doesn't use boringssl tho. see https://github.com/wkeeling/selenium-wire/issues/215#issuecomment-794362654
Interesting note here that:
from cdp_socket.utils.utils import launch_chrome, random_port
from cdp_socket.socket import CDPSocket
import os
import asyncio
global sock1
async def on_resumed(params):
global sock1
await sock1.exec("Fetch.continueRequest", {"requestId": params['requestId']})
print(params["request"]["url"])
async def main():
global sock1
PORT = random_port()
process = launch_chrome(PORT)
async with CDPSocket(PORT) as base_socket:
targets = await base_socket.targets
target = targets[0]
sock1 = await base_socket.get_socket(target)
await sock1.exec("Network.clearBrowserCookies")
await sock1.exec("Fetch.enable")
sock1.add_listener("Fetch.requestPaused", on_resumed)
await sock1.exec("Page.navigate", {"url": "https://nowsecure.nl#relax"})
await asyncio.sleep(5)
os.kill(process.pid, 15)
asyncio.run(main())
works just fine
works just fine
this works for requests, but not for responses
because Fetch.getResponseBody
always throws CDPError -32000
Fetch.getResponseBody CDPError -32000 -> Fetch.continueResponse https://nowsecure.nl/
waiting after Page.navigate
Fetch.getResponseBody CDPError -32000 -> Fetch.continueResponse https://nowsecure.nl/cdn-cgi/styles/challenges.css
Fetch.getResponseBody CDPError -32000 -> Fetch.continueResponse https://nowsecure.nl/cdn-cgi/challenge-platform/h/g/orchestr...
Fetch.getResponseBody CDPError -32000 -> Fetch.continueResponse https://challenges.cloudflare.com/turnstile/v0/g/74bd6362/ap...
Fetch.getResponseBody CDPError -32000 -> Fetch.continueResponse https://nowsecure.nl/favicon.ico
Fetch.getResponseBody CDPError -32000 -> Fetch.continueResponse https://nowsecure.nl/cdn-cgi/challenge-platform/h/g/flow/ov1...
Fetch.getResponseBody CDPError -32000 -> Fetch.continueResponse https://nowsecure.nl/favicon.ico
Fetch.getResponseBody CDPError -32000 -> Fetch.continueResponse https://challenges.cloudflare.com/cdn-cgi/challenge-platform...
chrome://net-export/
could be useful for passive capturing of traffic
Click the button to start logging future network activity to a file on disk. The log includes details of network activity from all of Chrome, including incognito and non-incognito tabs, visited URLs, and information about the network configuration
via chrome://net-internals/
Looks like Network.setRequestInterception has the same issues. WOnder tho why it's flaged as "Insecure", eventho the request is over HTTPS
wonder tho why it's flaged as "Insecure", eventho the request is over HTTPS
i guess it uses a local https proxy with a self-signed certificate
without adding that certificate as "trusted cert" to ~/.pki/nssdb/
but still, this fails to bypass cloudflare
Please unblock challenges.cloudflare.com to proceed.
Also interesting here, that local overrides with the chrome devtools just work fine:
i guess it uses a local https proxy with a self-signed certificate without adding that certificate as "trusted cert" to ~/.pki/nssdb/
ahh yep, that makes sense
but still, this fails to bypass cloudflare
maybe there's a way to detect self-signed certificate usage? If no, it's probably timing or SSL//TLS fingerprinting I guess
I see 2 possible aproaches here:
for now i gave up on intercepting requests... chrome seems to make it really hard, also to provide security against MITM attacks
probably i would try the frida
route
as i described in https://github.com/wkeeling/selenium-wire/issues/656#issuecomment-1848393185
for now i gave up on intercepting requests... chrome seems to make it really hard, also to provide security against MITM attacks
probably i would try the
frida
route as i described in wkeeling/selenium-wire#656 (comment)
Well yeah, eventho I assume that the memory manipulation//ddl hooking solutions are specific to:
At
chrome://net-export/
could be useful for passive capturing of trafficClick the button to start logging future network activity to a file on disk. The log includes details of network activity from all of Chrome, including incognito and non-incognito tabs, visited URLs, and information about the network configuration
via
chrome://net-internals/
Uhh I think passive capturing works as well with Fetch.enable
or Network.setRequestInterception
as long you don't modify the body btw
Even changing request headers works just fine
import asyncio
import base64
import sys
import time
import traceback
from cdp_socket.exceptions import CDPError
from selenium_driverless import webdriver
async def on_request(params, global_conn):
url = params["request"]["url"]
_params = {"interceptionId": params['interceptionId']}
if params.get('responseStatusCode') in [301, 302, 303, 307, 308]:
# redirected request
return await global_conn.execute_cdp_cmd("Network.continueInterceptedRequest", _params)
else:
fulfill_params = {"headers":params["request"]["headers"]}
fulfill_params["headers"]["test"] = "Hello World!"
fulfill_params.update(_params)
await global_conn.execute_cdp_cmd("Network.continueInterceptedRequest", fulfill_params)
print(url)
async def main():
options = webdriver.ChromeOptions()
async with webdriver.Chrome(options=options, max_ws_size=2 ** 30) as driver:
driver.base_target.socket.on_closed.append(lambda code, reason: print(f"chrome exited"))
global_conn = driver.current_target
await driver.get("about:blank")
await global_conn.execute_cdp_cmd("Network.enable", {"maxTotalBufferSize": 1_000_000, # 1GB
"maxResourceBufferSize": 1_000_000,
"maxPostDataSize": 1_000_000
})
await global_conn.execute_cdp_cmd("Network.setRequestInterception",
{"patterns": [{"urlPattern": "*",
# "interceptionStage": "HeadersReceived"
}]})
await global_conn.add_cdp_listener("Network.requestIntercepted", lambda data: on_request(data, global_conn))
await driver.get(
'https://nowsecure.nl',
timeout=60, wait_load=False)
while True:
await asyncio.sleep(10)
asyncio.run(main())
print(url)
and where is the response body?
and where is the response body?
Please unblock challenges.cloudflare.com to proceed.
this error appears when Fetch.fulfillRequest
has no response headers
fix:
async def requestPaused(args):
# ...
body = base64.b64encode(body).decode("ascii")
_args = {
"requestId": args["requestId"],
"responseCode": args["responseStatusCode"],
# fix: Please unblock challenges.cloudflare.com to proceed.
"responseHeaders": args["responseHeaders"],
"body": body,
}
if args["responseStatusText"] != "":
# empty string throws "Invalid http status code or phrase"
_args["responsePhrase"] = args["responseStatusText"]
await target.execute_cdp_cmd("Fetch.fulfillRequest", _args)
passive capturing works as well with
Fetch.enable
orNetwork.setRequestInterception
as long you don't modify the body
im looking for a generic solution, based on streams so i can handle infinite-size responses without storing the whole response in RAM and so i can handle streams of events with low latency
interceptionId
see also https://github.com/milahu/aiohttp_chromium/tree/main/test/stream-response
feel free to copy/paste/modify these scripts to Selenium-Driverless/examples/
see also https://github.com/milahu/aiohttp_chromium/tree/main/test/stream-response
feel free to copy/paste/modify these scripts to Selenium-Driverless/examples/
ah yep, thanks. Might be nice if you can keep it up long-term somewhere in your repo for reference
broken: Network.enable and Network.streamResourceContent and Network.dataReceived - this is broken in chromium 117, because data is always empty.
ah heck, well then Network
usage should probably be avoided as it's deprecated and more stuff might break in future chrome versions
Please unblock challenges.cloudflare.com to proceed.
this error appears when Fetch.fulfillRequest has no response headers
async def requestPaused(args): # ... body = base64.b64encode(body).decode("ascii") _args = { "requestId": args["requestId"], "responseCode": args["responseStatusCode"], # fix: Please unblock challenges.cloudflare.com to proceed. "responseHeaders": args["responseHeaders"], "body": body, } if args["responseStatusText"] != "": # empty string throws "Invalid http status code or phrase" _args["responsePhrase"] = > args["responseStatusText"] await target.execute_cdp_cmd("Fetch.fulfillRequest", _args)
Uh nice that we've finally got it working! Great job!
Wonder, is there any way to optimize base64.b64encode(body).decode("ascii")
even more btw?
And also, are we sure that Fetch.enable
intercepts as well:
I remember there being Network.setBypassServiceWorker
, however no idea if it affects Fetch.enable
as well.
If some still don't get intercepted, maybe target-interception might be considerable, see https://github.com/kaliiiiiiiiii/Selenium-Driverless/blob/4b71a5ab59a193d41eab80ed8f68a66e8ad5c230/tests/target_interception.py . I'm however not sure how reliable it is and how bad the timing leaks are.
then
Network
usage should probably be avoided as it's deprecated and more stuff might break in future chrome versions
Network.streamResourceContent and Network.dataReceived are not deprecated, but experimental so i expect them to work in newer versions
is there any way to optimize
base64.b64encode(body).decode("ascii")
im afraid no... i also would prefer a binary protocol, no base64, no json
base64 is needed for Fetch.fulfillRequest
body: string: A response body. If absent, original response body will be used if the request is intercepted at the response stage and empty body will be used if the request is intercepted at the request stage. (Encoded as a base64 string when passed over JSON)
when i pass the body as bytes
i get
TypeError: Object of type bytes is not JSON serializable
per CDP docs, the only non-JSON endpoint is
WebSocket
/devtools/page/{targetId}
The WebSocket endpoint for the protocol.are we sure that
Fetch.enable
intercepts as well
no idea, i dont need these targets
in Fetch.requestPaused.py im calling
target = await driver.current_target
# ...
await target.execute_cdp_cmd("Fetch.enable", args)
await target.add_cdp_listener("Fetch.requestPaused", requestPaused)
but this also works with
await driver.execute_cdp_cmd("Fetch.enable", args)
await driver.add_cdp_listener("Fetch.requestPaused", requestPaused)
then requestPaused
should be called for all targets
Also, I'm just thinking about - if we can't stream the responses when intercepting the requests - there's technically a way to detect the timing (if the server responds in chuncks), right?
And even if it would be possible, I suppose there could be a way to setup a server with sepecific chunk timing & size + detect that at JavaScript.
See http://scatter.cowchimp.com/ for a poc on scattering the chunk timing
aah, now i understand your question
are we sure that
Fetch.enable
intercepts as well
so ideally, all targets should be intercepted to add the same latency to all requests
practically, i would avoid this premature optimization because different latencies can have legitimate reasons like different cpu loads on different cpu cores
maybe put this on a todo list / future work list / debug ideas list in case cloudflare blocking becomes more aggressive
target = await driver.current_target # ... await target.execute_cdp_cmd("Fetch.
yeah ofc - as this will executes cdp on the same target.
I'm not sure if//how driver.base_target
behaves tbh. I could imagine, that service-worker requests are only covered by base_target. At least for target interception, this is the case.
Network.streamResourceContent and Network.dataReceived are not deprecated, but experimental so i expect them to work in newer versions
bad news: this also fails with chromium 120
maybe this is a bug in selenium_driverless
?
tomorrow i will port Network.dataReceived.py to selenium
i would be surprised if this is a chromium bug
Network.streamResourceContent and Network.dataReceived are not deprecated, but experimental so i expect them to work in newer versions
bad news: this also fails with chromium 120
maybe this is a bug in
selenium_driverless
? tomorrow i will port Network.dataReceived.py toselenium
i would be surprised if this is a chromium bug
mhh maybe try with bare CDP-socket. Wouldn't know why driverless could break this. Unless it's some chrome flag which gets applied by default
tomorrow i will port Network.dataReceived.py to
selenium
not possible, because chromedriver does not support the Network.streamResourceContent command
so there is no
await session.execute(devtools.network.stream_resource_content(request_id))
# or
driver.execute("Network.streamResourceContent", {"requestId": request_id})
there is only network.take_response_body_for_interception_as_stream
await session.execute(devtools.network.take_response_body_for_interception_as_stream(interception_id))
... but that requires an interception_id
and there is still no IO.write
so i cannot send the stream to chromium
see also Selenium 4: how add event listeners in CDP
i have the impression that this feature (reading and writing of streams) is deliberately not implemented by CDP
see also Fetch.fulfillRequest and (very) long body
Unfortunately, there's no streaming support for Fetch network interception at the moment
yeah, totally "unfortunately" and totally "at the moment"
no, i guess this is very deliberate sabotage, to prevent "abusing" chromium as a generic http client which is pretty much what we are trying to do here...
so... i really tried to avoid this part (because i have zero experience here) but i will have to use frida to insert hooks into the chromium binary
for now i gave up on intercepting requests... chrome seems to make it really hard, also to provide security against MITM attacks
probably i would try the
frida
route as i described in wkeeling/selenium-wire#656 (comment)
lets see what tomorrow will bring ; )
tomorrow i will port Network.dataReceived.py to
selenium
not possible, because chromedriver does not support the Network.streamResourceContent command
so there is no
await session.execute(devtools.network.stream_resource_content(request_id)) # or driver.execute("Network.streamResourceContent", {"requestId": request_id})
there is only network.take_response_body_for_interception_as_stream
await session.execute(devtools.network.take_response_body_for_interception_as_stream(interception_id))
... but that requires an
interception_id
and there is still noIO.write
so i cannot send the stream to chromiumsee also Selenium 4: how add event listeners in CDP
CDP is broken by design?
i have the impression that this feature (reading and writing of streams) is deliberately not implemented by CDP
Yeah that might indeed be the case. As well due to security reasons such as streaming all stuff encrypted trough a proxy.
see also Fetch.fulfillRequest and (very) long body
Unfortunately, there's no streaming support for Fetch network interception at the moment
yeah, totally "unfortunately" and totally "at the moment"
no, i guess this is very deliberate sabotage, to prevent "abusing" chromium as a generic http client which is pretty much what we are trying to do here...
yeah, I guess so
dynamic analysis
so... i really tried to avoid this part (because i have zero experience here) but i will have to use frida to insert hooks into the chromium binary
for now i gave up on intercepting requests... chrome seems to make it really hard, also to provide security against MITM attacks probably i would try the
frida
route as i described in wkeeling/selenium-wire#656 (comment)lets see what tomorrow will bring ; )
well have funn hahe👀 gonna be a pain. Pretty sure Chrome has stuff against that implemented.
not resolved yet lol
well... the original issue is fixed by sending responseHeaders
currently i dont have time to implement reading and writing of streams
also i guess this is out-of-scope for selenium_driverless
because this is not possible with CDP
well... the original issue is fixed by sending
responseHeaders
currently i dont have time to implement reading and writing of streams also i guess this is out-of-scope for
selenium_driverless
because this is not possible with CDP
Hmm does https://bugs.chromium.org/p/chromium/issues/detail?id=1138839 still apply tho? Also, I'm not that sure if all headers have the correct order tbh
Maybe using binaryResponseHeaders
for continuing the request would be more safe?
@milahu
probably i would try the frida route
Maybe https://github.com/tomer8007/chromium-ipc-sniffer could be a consideration worth👀 screenshot below id 4 years old, some stuff might have changed ofc.
i would be surprised if that works the raw HTTP traffic is hidden for better security
However, this project won't see anything that doesn't go over pipes, which is mostly shared memory IPC:
- Mojo data pipe contents (raw networking buffers, audio, etc.)
... so the raw HTTP traffic is in shared memory
the most promising method is running chromium in a debugger, either gdb or lldb but i have to disable sandboxing to set breakpoints on BIO_read and BIO_write radare is too slow, frida fails to hook the functions gdb works, but parsing its output is slow, and gdb in python is kinda broken lldb would be better for interfacing with python (or native code), but its kinda broken... see also chromium-capture-http
but all these are workarounds and a proper fix would be to implement full http stream support to fix either Fetch.requestPaused.py or Network.dataReceived.py
effectively, this would allow inserting an http proxy with full control over request and response streams
its surprising that such a basic feature is missing
there is Fetch.takeResponseBodyAsStream and IO.read but not Fetch.giveResponseBodyAsStream and IO.write
there is Network.takeResponseBodyForInterceptionAsStream and IO.read but not Network.giveResponseBodyForInterceptionAsStream and IO.write
currently this has zero priority for me, i just dont need it
will be fixed with https://github.com/kaliiiiiiiiii/Selenium-Driverless/blob/dev/src/selenium_driverless/scripts/network_interceptor.py
I'll close this issue when it's released & the documentation is complete
a proper fix would be to implement full http stream support
nothing new from google https://issues.chromium.org/issues/332570739
just another feature request which would be easy to implement, but is ignored as "low priority"
im trying to capture all responses as described in readme#use-events
cloudflare says
chrome shows a warning in the address bar
fixed by adding
options.add_argument("--disable-web-security")
to don't enforce the same-origin policytest_selenium_driverless.py
```py #!/usr/bin/env python3 import asyncio import base64 import sys import time import traceback from cdp_socket.exceptions import CDPError from selenium_driverless import webdriver async def on_request(params, global_conn): url = params["request"]["url"] _params = {"requestId": params['requestId']} if params.get('responseStatusCode') in [301, 302, 303, 307, 308]: # redirected request return await global_conn.execute_cdp_cmd("Fetch.continueResponse", _params) else: try: body = await global_conn.execute_cdp_cmd("Fetch.getResponseBody", _params, timeout=1) except CDPError as e: if e.code == -32000 and e.message == 'Can only get response body on requests captured after headers received.': print(params, "\n", file=sys.stderr) traceback.print_exc() await global_conn.execute_cdp_cmd("Fetch.continueResponse", _params) else: raise e else: start = time.monotonic() body_decoded = base64.b64decode(body['body']) # modify body here body_modified = base64.b64encode(body_decoded).decode("ascii") fulfill_params = {"responseCode": 200, "body": body_modified} fulfill_params.update(_params) _time = time.monotonic() - start if _time > 0.01: print(f"decoding took long: {_time} s") await global_conn.execute_cdp_cmd("Fetch.fulfillRequest", fulfill_params) print("Mocked response", url) async def main(): options = webdriver.ChromeOptions() options.add_argument("--window-size=500,900") # fix: please unblock challenges.cloudflare.com to proceed # Don't enforce the same-origin policy options.add_argument("--disable-web-security") async with webdriver.Chrome(options=options, max_ws_size=2 ** 30) as driver: driver.base_target.socket.on_closed.append(lambda code, reason: print(f"chrome exited")) global_conn = driver.base_target await driver.get("about:blank") await global_conn.execute_cdp_cmd("Fetch.enable", cmd_args={"patterns": [{"requestStage": "Response", "urlPattern":"*"}]}) await global_conn.add_cdp_listener("Fetch.requestPaused", lambda data: on_request(data, global_conn)) await driver.get( #'https://wikipedia.org', "https://nowsecure.nl/#relax", # test cloudflare timeout=60, wait_load=False) while True: #time.sleep(10) # no. cloudflare would hang await asyncio.sleep(10) asyncio.run(main()) ```