Open jwwq opened 5 months ago
Hi there @jwwq
Im also trying to solve this situation, I'm under the impression that calling tab.send
inside the event callback causes a deadlock, here is a snippet of how got something working, my use case is to extract the data from all Ajax requests
import time
import nodriver as uc
from nodriver import cdp
xhr_requests = []
last_xhr_request = None
def listenXHR(page):
async def handler(evt):
# get ajax requests
if evt.type_ is cdp.network.ResourceType.XHR or evt.type_ is cdp.network.ResourceType.FETCH:
xhr_requests.append([evt.response.url, evt.request_id])
global last_xhr_request
last_xhr_request = time.time()
page.add_handler(cdp.network.ResponseReceived, handler)
async def receiveXHR(page, requests):
responses = []
retries = 0
max_retries = 5
# wait at least 2 second after the last xhr request to get some more
while True:
if last_xhr_request is None or retries > max_retries:
break
if time.time() - last_xhr_request <= 2:
retries = retries + 1
time.sleep(2)
continue
else:
break
await page # this is very important
# loop through gathered requests and get its response body
for request in requests:
try:
res = await page.send(cdp.network.get_response_body(request[1]))
if res is None:
continue
responses.append({
'url': request[0],
'body': res[0],
'is_base64': res[1]
})
except Exception as e:
print("error get body", e)
return responses
async def crawl():
browser = await uc.start(headless=False)
# use main tab
tab = browser.main_tab
listenXHR(tab)
# change url to something that makes ajax requests
tab = await browser.get("https://example.com")
time.sleep(2)
xhr_responses = await receiveXHR(tab, xhr_requests)
print(xhr_responses)
if __name__ == '__main__':
uc.loop().run_until_complete(crawl())
Excuse my python, i have been using the language for less than 10h lol
NOTE: If i call
cdp.network.get_response_body
on every request then i get None for all, so i had to pick specifically which urls to add intoxhr_requests
variable for it to work
I hope this help somehow, and looking forward for a better solution or examples/explanation on how to actually do this
Hi @falmar, your code helped me a lot for my use case. I have a few suggestions to make for the code you provided:-
Here's a slightly modified version of the same code you provided:-
import asyncio
import nodriver as uc
from nodriver import cdp
class RequestMonitor:
def __init__(self):
self.requests = []
self.last_request = None
self.lock = asyncio.Lock()
async def listen(self, page):
async def handler(evt):
async with self.lock:
if evt.response.encoded_data_length > 0 and evt.type_ is cdp.network.ResourceType.XHR:
# print(f"EVENT PERCEIVED BY BROWSER IS:- {evt.type_}") # If unsure about event or to check behaviour of browser
self.requests.append([evt.response.url, evt.request_id])
self.last_request = time.time()
page.add_handler(cdp.network.ResponseReceived, handler)
async def receive(self, page):
responses = []
retries = 0
max_retries = 5
# Wait at least 2 seconds after the last XHR request to get some more
while True:
if self.last_request is None or retries > max_retries:
break
if time.time() - self.last_request <= 2:
retries += 1
await asyncio.sleep(2)
continue
else:
break
await page # Waiting for page operation to complete.
# Loop through gathered requests and get its response body
async with self.lock:
for request in self.requests:
try:
res = await page.send(cdp.network.get_response_body(request[1]))
if res is None:
continue
responses.append({
'url': request[0],
'body': res[0], # Assuming res[0] is the response body
'is_base64': res[1] # Assuming res[1] indicates if response is base64 encoded
})
except Exception as e:
print("Error getting body", e)
return responses
async def crawl():
browser = await uc.start(headless=False)
monitor = RequestMonitor()
tab = browser.main_tab
await monitor.listenXHR(tab)
# Change URL based on use case.
tab = await browser.get("https://www.example.com")
await asyncio.sleep(2)
xhr_responses = await monitor.receiveXHR(tab)
# Print URL and response body
for response in xhr_responses:
print(f"URL: {response['url']}")
print("Response Body:")
print(response['body'] if not response['is_base64'] else "Base64 encoded data")
if __name__ == '__main__':
uc.loop().run_until_complete(crawl())
Apologies if I have made any mistakes and for my English too.
Hi @falmar, your code helped me a lot for my use case. I have a few suggestions to make for the code you provided:-
1. Checking the encoded_data_length for the evt :- I ran into this issue of receiving None for response body, hence if this check is included inside listenXHR function it'll help in providing that extra layer of check for response body. 2. Using asyncio.sleep() instead of time.sleep():- This is a minor change, but I've heard that time.sleep() is blocking in nature, hence it's good to use asyncio.sleep() instead. 3. Using asyncio.lock() :- Again just a minor change, asyncio.lock() used inside a class to encapsulate global variables provides additional protection for preventing corruption of data and race condition.
Here's a slightly modified version of the same code you provided:-
import asyncio import nodriver as uc from nodriver import cdp class RequestMonitor: def __init__(self): self.requests = []
Apologies if I have made any mistakes and for my English too.
You're aware this code does not run as posted, correct?
Hi @falmar, your code helped me a lot for my use case. I have a few suggestions to make for the code you provided:-
1. Checking the encoded_data_length for the evt :- I ran into this issue of receiving None for response body, hence if this check is included inside listenXHR function it'll help in providing that extra layer of check for response body. 2. Using asyncio.sleep() instead of time.sleep():- This is a minor change, but I've heard that time.sleep() is blocking in nature, hence it's good to use asyncio.sleep() instead. 3. Using asyncio.lock() :- Again just a minor change, asyncio.lock() used inside a class to encapsulate global variables provides additional protection for preventing corruption of data and race condition.
Here's a slightly modified version of the same code you provided:-
import asyncio import nodriver as uc from nodriver import cdp class RequestMonitor: def __init__(self): self.requests = []
Apologies if I have made any mistakes and for my English too.
You're aware this code does not run as posted, correct?
Apologies, I might have made mistakes while modifying it , could you tell me the issue that you're encountering? I ran it in my system, it was working fine for me.
I have modified the script by @utam-1 to work.
Mostly what was broken, was just some variables that had their name changed, like calling monitor.listenXHR
when the function is actually monitor.listen
.
Also, since the script is printing out XHR, loading up example.com won't return anything, as it doesn't use any XHR (as pointed out in the original answer by @falmar).
I've also added typing, just cause I was gonna add that anyways, when using it for my project, and never hurts y'know.
I personally still have a suspicion that there must be a better way of achieving this, but I do think this is just a tad closer. Also, the original script posted by @falmar does seem to work for me, but I am just going off of the assumption that the changes by @utam-1 are actually improvements, as I myself don't know any better.
# Modified from https://github.com/ultrafunkamsterdam/undetected-chromedriver/issues/1832#issuecomment-2075243964, https://github.com/ultrafunkamsterdam/undetected-chromedriver/issues/1832#issuecomment-2092205033
# Tested working with Python 3.12.5, Windows 11, nodriver 0.36.
import asyncio
import nodriver as uc
from nodriver import cdp
import time
import typing
class ResponseType(typing.TypedDict):
url: str
body: str
is_base64: bool
class RequestMonitor:
def __init__(self):
# Typed this way, as I couldn't figure out how to do Typescript-like tuples.
self.requests: list[list[str | cdp.network.RequestId]] = []
self.last_request: float | None = None
self.lock = asyncio.Lock()
async def listen(self, page: uc.Tab):
async def handler(evt: cdp.network.ResponseReceived):
async with self.lock:
if evt.response.encoded_data_length > 0 and evt.type_ is cdp.network.ResourceType.XHR:
#print(f'EVENT PERCEIVED BY BROWSER IS:- {evt.type_}') # If unsure about event or to check behaviour of browser
self.requests.append([evt.response.url, evt.request_id])
self.last_request = time.time()
page.add_handler(cdp.network.ResponseReceived, handler)
async def receive(self, page: uc.Tab):
responses: list[ResponseType] = []
retries = 0
max_retries = 5
# Wait at least 2 seconds after the last XHR request to get some more
while True:
if self.last_request is None or retries > max_retries:
break
if time.time() - self.last_request <= 2:
retries += 1
await asyncio.sleep(2)
continue
else:
break
await page # Waiting for page operation to complete.
# Loop through gathered requests and get its response body
async with self.lock:
for request in self.requests:
try:
if not isinstance(request[1], cdp.network.RequestId):
raise ValueError('Request ID is not of type RequestId')
res = await page.send(cdp.network.get_response_body(request[1]))
if res is None:
continue
responses.append({
'url': request[0],
'body': res[0], # Assuming res[0] is the response body
'is_base64': res[1] # Assuming res[1] indicates if response is base64 encoded
})
except Exception as e:
print('Error getting body', e)
return responses
async def crawl():
browser = await uc.start(headless=False)
monitor = RequestMonitor()
tab = await browser.get('about:blank')
await monitor.listen(tab)
# Change URL based on use case.
tab = await browser.get('https://bing.com')
xhr_responses = await monitor.receive(tab)
# Print URL and response body
for response in xhr_responses:
print(f'URL: {response['url']}')
print('Response Body:')
print(response['body'] if not response['is_base64'] else 'Base64 encoded data')
if __name__ == '__main__':
uc.loop().run_until_complete(crawl())
Hope this helps the next person that comes across this issue :)
Thanks you saved me lot of time!!!!
I have modified the script by @utam-1 to work. Mostly what was broken, was just some variables that had their name changed, like calling
monitor.listenXHR
when the function is actuallymonitor.listen
. Also, since the script is printing out XHR, loading up example.com won't return anything, as it doesn't use any XHR (as pointed out in the original answer by @falmar). I've also added typing, just cause I was gonna add that anyways, when using it for my project, and never hurts y'know. I personally still have a suspicion that there must be a better way of achieving this, but I do think this is just a tad closer. Also, the original script posted by @falmar does seem to work for me, but I am just going off of the assumption that the changes by @utam-1 are actually improvements, as I myself don't know any better.# Modified from https://github.com/ultrafunkamsterdam/undetected-chromedriver/issues/1832#issuecomment-2075243964, https://github.com/ultrafunkamsterdam/undetected-chromedriver/issues/1832#issuecomment-2092205033 # Tested working with Python 3.12.5, Windows 11, nodriver 0.36. import asyncio import nodriver as uc from nodriver import cdp import time import typing class ResponseType(typing.TypedDict): url: str body: str is_base64: bool class RequestMonitor: def __init__(self): # Typed this way, as I couldn't figure out how to do Typescript-like tuples. self.requests: list[list[str | cdp.network.RequestId]] = [] self.last_request: float | None = None self.lock = asyncio.Lock() async def listen(self, page: uc.Tab): async def handler(evt: cdp.network.ResponseReceived): async with self.lock: if evt.response.encoded_data_length > 0 and evt.type_ is cdp.network.ResourceType.XHR: #print(f'EVENT PERCEIVED BY BROWSER IS:- {evt.type_}') # If unsure about event or to check behaviour of browser self.requests.append([evt.response.url, evt.request_id]) self.last_request = time.time() page.add_handler(cdp.network.ResponseReceived, handler) async def receive(self, page: uc.Tab): responses: list[ResponseType] = [] retries = 0 max_retries = 5 # Wait at least 2 seconds after the last XHR request to get some more while True: if self.last_request is None or retries > max_retries: break if time.time() - self.last_request <= 2: retries += 1 await asyncio.sleep(2) continue else: break await page # Waiting for page operation to complete. # Loop through gathered requests and get its response body async with self.lock: for request in self.requests: try: if not isinstance(request[1], cdp.network.RequestId): raise ValueError('Request ID is not of type RequestId') res = await page.send(cdp.network.get_response_body(request[1])) if res is None: continue responses.append({ 'url': request[0], 'body': res[0], # Assuming res[0] is the response body 'is_base64': res[1] # Assuming res[1] indicates if response is base64 encoded }) except Exception as e: print('Error getting body', e) return responses async def crawl(): browser = await uc.start(headless=False) monitor = RequestMonitor() tab = await browser.get('about:blank') await monitor.listen(tab) # Change URL based on use case. tab = await browser.get('https://bing.com') xhr_responses = await monitor.receive(tab) # Print URL and response body for response in xhr_responses: print(f'URL: {response['url']}') print('Response Body:') print(response['body'] if not response['is_base64'] else 'Base64 encoded data') if __name__ == '__main__': uc.loop().run_until_complete(crawl())
Hope this helps the next person that comes across this issue :)
Thanks for pointing out those mistakes! I was experimenting with the code a lot and made those errors while commenting - apologies to anyone who tried the broken code😅.
So is there no way at all to run res = await page.send(cdp.network.get_response_body(request[1]))
this inside the ResponseReceived
handler ?
So is there no way at all to run
res = await page.send(cdp.network.get_response_body(request[1]))
this inside theResponseReceived
handler ?
Maybe there is, but as of now there is no workaround that. Apart from the approach suggested by @falmar and the subsequent improvements suggested by me and @jensmogens, I also devised another approach. The idea is to have separate co-routines to handle processing of response body and that of finished events. This is done using asyncio queue. The approach basically follows producer-consumer pattern. However, it would be of no use to post that method here I believe as we've already tried bifurcating the co-routines to achieve our goal.
Good afternoon, thank you for your great work! Based on your "network_monitor.py" example, I try to retrieve the contents of the response. I am using the LoadingFinished handler to make sure that the file is retrieved completely. Unfortunately, the process hangs forever when I'm trying to send command to CDP (see full code below).
cdp_cmd = cdp.network.get_response_body(event.request_id)
res = await global_browser.main_tab.send(cdp_cmd)
Please help!
(other than that there is one more question: is there any way to get tab in handler without global variables, but it's a minor issue)