ultrafunkamsterdam / undetected-chromedriver

Custom Selenium Chromedriver | Zero-Config | Passes ALL bot mitigation systems (like Distil / Imperva/ Datadadome / CloudFlare IUAM)
https://github.com/UltrafunkAmsterdam/undetected-chromedriver
GNU General Public License v3.0
9.61k stars 1.14k forks source link

Nodriver: CDP get_response_body command not working #1832

Open jwwq opened 5 months ago

jwwq commented 5 months ago

Good afternoon, thank you for your great work! Based on your "network_monitor.py" example, I try to retrieve the contents of the response. I am using the LoadingFinished handler to make sure that the file is retrieved completely. Unfortunately, the process hangs forever when I'm trying to send command to CDP (see full code below).

cdp_cmd = cdp.network.get_response_body(event.request_id)
res = await global_browser.main_tab.send(cdp_cmd)

Please help!

(other than that there is one more question: is there any way to get tab in handler without global variables, but it's a minor issue)


from nodriver import start, cdp, loop

global_tab = None

async def main():
    browser = await start()
    tab = browser.main_tab
    global global_tab
    global_tab = tab
    tab.add_handler(cdp.network.RequestWillBeSent, send_handler)
    tab.add_handler(cdp.network.ResponseReceived, receive_handler)
    tab.add_handler(cdp.network.LoadingFinished, finished_handler)

    tab = await browser.get("https://www.google.com/?hl=en")

async def receive_handler(event: cdp.network.ResponseReceived):
    # print(event.response)
    return

async def send_handler(event: cdp.network.RequestWillBeSent):
    return

async def finished_handler(event: cdp.network.LoadingFinished):
    global global_tab
    print("finished:", event.request_id, ":", event.encoded_data_length)    
    if event.encoded_data_length > 0:
        cdp_cmd = cdp.network.get_response_body(event.request_id)
        print("SENDING...")
        res = await global_tab.send(cdp_cmd)
        # THE PROCESS HANGS HERE FOREVER.
        print("RESULT:", res)       

if __name__ == "__main__":
    loop().run_until_complete(main())
falmar commented 4 months ago

Hi there @jwwq

Im also trying to solve this situation, I'm under the impression that calling tab.send inside the event callback causes a deadlock, here is a snippet of how got something working, my use case is to extract the data from all Ajax requests

import time
import nodriver as uc
from nodriver import cdp

xhr_requests = []
last_xhr_request = None

def listenXHR(page):
    async def handler(evt):
        # get ajax requests
        if evt.type_ is cdp.network.ResourceType.XHR or evt.type_ is cdp.network.ResourceType.FETCH:
            xhr_requests.append([evt.response.url, evt.request_id])
            global last_xhr_request
            last_xhr_request = time.time()

    page.add_handler(cdp.network.ResponseReceived, handler)

async def receiveXHR(page, requests):
    responses = []
    retries = 0
    max_retries = 5

    # wait at least 2 second after the last xhr request to get some more
    while True:
        if last_xhr_request is None or retries > max_retries:
            break

        if time.time() - last_xhr_request <= 2:
            retries = retries + 1
            time.sleep(2)

            continue
        else:
            break

    await page # this is very important

    # loop through gathered requests and get its response body
    for request in requests:
        try:
            res = await page.send(cdp.network.get_response_body(request[1]))
            if res is None:
                continue

            responses.append({
                'url': request[0],
                'body': res[0],
                'is_base64': res[1]
            })
        except Exception as e:
            print("error get body", e)

    return responses

async def crawl():
    browser = await uc.start(headless=False)

    # use main tab
    tab = browser.main_tab

    listenXHR(tab)

    # change url to something that makes ajax requests
    tab = await browser.get("https://example.com")
    time.sleep(2)
    xhr_responses = await receiveXHR(tab, xhr_requests)

    print(xhr_responses)

if __name__ == '__main__':
    uc.loop().run_until_complete(crawl())

Excuse my python, i have been using the language for less than 10h lol

NOTE: If i call cdp.network.get_response_body on every request then i get None for all, so i had to pick specifically which urls to add into xhr_requests variable for it to work

I hope this help somehow, and looking forward for a better solution or examples/explanation on how to actually do this

utam-1 commented 4 months ago

Hi @falmar, your code helped me a lot for my use case. I have a few suggestions to make for the code you provided:-

  1. Checking the encoded_data_length for the evt :- I ran into this issue of receiving None for response body, hence if this check is included inside listenXHR function it'll help in providing that extra layer of check for response body.
  2. Using asyncio.sleep() instead of time.sleep():- This is a minor change, but I've heard that time.sleep() is blocking in nature, hence it's good to use asyncio.sleep() instead.
  3. Using asyncio.lock() :- Again just a minor change, asyncio.lock() used inside a class to encapsulate global variables provides additional protection for preventing corruption of data and race condition.

Here's a slightly modified version of the same code you provided:-

import asyncio
import nodriver as uc
from nodriver import cdp

class RequestMonitor:
    def __init__(self):
        self.requests = []
        self.last_request = None
        self.lock = asyncio.Lock()

    async def listen(self, page):
        async def handler(evt):
            async with self.lock:
                if evt.response.encoded_data_length > 0 and evt.type_ is cdp.network.ResourceType.XHR:
                    # print(f"EVENT PERCEIVED BY BROWSER IS:- {evt.type_}") # If unsure about event or to check behaviour of browser
                    self.requests.append([evt.response.url, evt.request_id])
                    self.last_request = time.time()

        page.add_handler(cdp.network.ResponseReceived, handler)

    async def receive(self, page):
        responses = []
        retries = 0
        max_retries = 5

        # Wait at least 2 seconds after the last XHR request to get some more
        while True:
            if self.last_request is None or retries > max_retries:
                break

            if time.time() - self.last_request <= 2:
                retries += 1
                await asyncio.sleep(2)
                continue
            else:
                break

        await page  # Waiting for page operation to complete.

        # Loop through gathered requests and get its response body
        async with self.lock:
            for request in self.requests:
                try:
                    res = await page.send(cdp.network.get_response_body(request[1]))
                    if res is None:
                        continue
                    responses.append({
                        'url': request[0],
                        'body': res[0],  # Assuming res[0] is the response body
                        'is_base64': res[1]  # Assuming res[1] indicates if response is base64 encoded
                    })
                except Exception as e:
                    print("Error getting body", e)

        return responses

async def crawl():
    browser = await uc.start(headless=False)
    monitor = RequestMonitor()
    tab = browser.main_tab

    await monitor.listenXHR(tab)

    # Change URL based on use case.
    tab = await browser.get("https://www.example.com")

    await asyncio.sleep(2)

    xhr_responses = await monitor.receiveXHR(tab)

    # Print URL and response body
    for response in xhr_responses:
        print(f"URL: {response['url']}")
        print("Response Body:")
        print(response['body'] if not response['is_base64'] else "Base64 encoded data")

if __name__ == '__main__':
    uc.loop().run_until_complete(crawl())

Apologies if I have made any mistakes and for my English too.

RzNmKX commented 3 months ago

Hi @falmar, your code helped me a lot for my use case. I have a few suggestions to make for the code you provided:-

1. Checking the encoded_data_length for the evt :- I ran into this issue of receiving None for response body, hence if this check is included inside listenXHR function it'll help in providing that extra layer of check for response body.

2. Using asyncio.sleep() instead of time.sleep():- This is a minor change, but I've heard that time.sleep() is blocking in nature, hence it's good to use asyncio.sleep() instead.

3. Using asyncio.lock() :- Again just a minor change, asyncio.lock() used inside a class to encapsulate global variables provides additional protection for preventing corruption of data and race condition.

Here's a slightly modified version of the same code you provided:-

import asyncio
import nodriver as uc
from nodriver import cdp

class RequestMonitor:
    def __init__(self):
        self.requests = []

Apologies if I have made any mistakes and for my English too.

You're aware this code does not run as posted, correct?

utam-1 commented 3 months ago

Hi @falmar, your code helped me a lot for my use case. I have a few suggestions to make for the code you provided:-

1. Checking the encoded_data_length for the evt :- I ran into this issue of receiving None for response body, hence if this check is included inside listenXHR function it'll help in providing that extra layer of check for response body.

2. Using asyncio.sleep() instead of time.sleep():- This is a minor change, but I've heard that time.sleep() is blocking in nature, hence it's good to use asyncio.sleep() instead.

3. Using asyncio.lock() :- Again just a minor change, asyncio.lock() used inside a class to encapsulate global variables provides additional protection for preventing corruption of data and race condition.

Here's a slightly modified version of the same code you provided:-

import asyncio
import nodriver as uc
from nodriver import cdp

class RequestMonitor:
    def __init__(self):
        self.requests = []

Apologies if I have made any mistakes and for my English too.

You're aware this code does not run as posted, correct?

Apologies, I might have made mistakes while modifying it , could you tell me the issue that you're encountering? I ran it in my system, it was working fine for me.

jensmogens commented 2 weeks ago

I have modified the script by @utam-1 to work. Mostly what was broken, was just some variables that had their name changed, like calling monitor.listenXHR when the function is actually monitor.listen. Also, since the script is printing out XHR, loading up example.com won't return anything, as it doesn't use any XHR (as pointed out in the original answer by @falmar). I've also added typing, just cause I was gonna add that anyways, when using it for my project, and never hurts y'know. I personally still have a suspicion that there must be a better way of achieving this, but I do think this is just a tad closer. Also, the original script posted by @falmar does seem to work for me, but I am just going off of the assumption that the changes by @utam-1 are actually improvements, as I myself don't know any better.

# Modified from https://github.com/ultrafunkamsterdam/undetected-chromedriver/issues/1832#issuecomment-2075243964, https://github.com/ultrafunkamsterdam/undetected-chromedriver/issues/1832#issuecomment-2092205033
# Tested working with Python 3.12.5, Windows 11, nodriver 0.36.

import asyncio
import nodriver as uc
from nodriver import cdp
import time
import typing

class ResponseType(typing.TypedDict):
    url: str
    body: str
    is_base64: bool

class RequestMonitor:
    def __init__(self):
        # Typed this way, as I couldn't figure out how to do Typescript-like tuples.
        self.requests: list[list[str | cdp.network.RequestId]] = []
        self.last_request: float | None = None
        self.lock = asyncio.Lock()

    async def listen(self, page: uc.Tab):
        async def handler(evt: cdp.network.ResponseReceived):
            async with self.lock:
                if evt.response.encoded_data_length > 0 and evt.type_ is cdp.network.ResourceType.XHR:
                    #print(f'EVENT PERCEIVED BY BROWSER IS:- {evt.type_}') # If unsure about event or to check behaviour of browser
                    self.requests.append([evt.response.url, evt.request_id])
                    self.last_request = time.time()

        page.add_handler(cdp.network.ResponseReceived, handler)

    async def receive(self, page: uc.Tab):
        responses: list[ResponseType] = []
        retries = 0
        max_retries = 5

        # Wait at least 2 seconds after the last XHR request to get some more
        while True:
            if self.last_request is None or retries > max_retries:
                break

            if time.time() - self.last_request <= 2:
                retries += 1
                await asyncio.sleep(2)
                continue
            else:
                break

        await page  # Waiting for page operation to complete.

        # Loop through gathered requests and get its response body
        async with self.lock:
            for request in self.requests:
                try:
                    if not isinstance(request[1], cdp.network.RequestId):
                        raise ValueError('Request ID is not of type RequestId')

                    res = await page.send(cdp.network.get_response_body(request[1]))
                    if res is None:
                        continue
                    responses.append({
                        'url': request[0],
                        'body': res[0],  # Assuming res[0] is the response body
                        'is_base64': res[1]  # Assuming res[1] indicates if response is base64 encoded
                    })
                except Exception as e:
                    print('Error getting body', e)

        return responses

async def crawl():
    browser = await uc.start(headless=False)
    monitor = RequestMonitor()

    tab = await browser.get('about:blank')

    await monitor.listen(tab)

    # Change URL based on use case.
    tab = await browser.get('https://bing.com')

    xhr_responses = await monitor.receive(tab)

    # Print URL and response body
    for response in xhr_responses:
        print(f'URL: {response['url']}')
        print('Response Body:')
        print(response['body'] if not response['is_base64'] else 'Base64 encoded data')

if __name__ == '__main__':
    uc.loop().run_until_complete(crawl())

Hope this helps the next person that comes across this issue :)

makovez commented 2 days ago

Thanks you saved me lot of time!!!!

utam-1 commented 2 days ago

I have modified the script by @utam-1 to work. Mostly what was broken, was just some variables that had their name changed, like calling monitor.listenXHR when the function is actually monitor.listen. Also, since the script is printing out XHR, loading up example.com won't return anything, as it doesn't use any XHR (as pointed out in the original answer by @falmar). I've also added typing, just cause I was gonna add that anyways, when using it for my project, and never hurts y'know. I personally still have a suspicion that there must be a better way of achieving this, but I do think this is just a tad closer. Also, the original script posted by @falmar does seem to work for me, but I am just going off of the assumption that the changes by @utam-1 are actually improvements, as I myself don't know any better.

# Modified from https://github.com/ultrafunkamsterdam/undetected-chromedriver/issues/1832#issuecomment-2075243964, https://github.com/ultrafunkamsterdam/undetected-chromedriver/issues/1832#issuecomment-2092205033
# Tested working with Python 3.12.5, Windows 11, nodriver 0.36.

import asyncio
import nodriver as uc
from nodriver import cdp
import time
import typing

class ResponseType(typing.TypedDict):
    url: str
    body: str
    is_base64: bool

class RequestMonitor:
    def __init__(self):
        # Typed this way, as I couldn't figure out how to do Typescript-like tuples.
        self.requests: list[list[str | cdp.network.RequestId]] = []
        self.last_request: float | None = None
        self.lock = asyncio.Lock()

    async def listen(self, page: uc.Tab):
        async def handler(evt: cdp.network.ResponseReceived):
            async with self.lock:
                if evt.response.encoded_data_length > 0 and evt.type_ is cdp.network.ResourceType.XHR:
                    #print(f'EVENT PERCEIVED BY BROWSER IS:- {evt.type_}') # If unsure about event or to check behaviour of browser
                    self.requests.append([evt.response.url, evt.request_id])
                    self.last_request = time.time()

        page.add_handler(cdp.network.ResponseReceived, handler)

    async def receive(self, page: uc.Tab):
        responses: list[ResponseType] = []
        retries = 0
        max_retries = 5

        # Wait at least 2 seconds after the last XHR request to get some more
        while True:
            if self.last_request is None or retries > max_retries:
                break

            if time.time() - self.last_request <= 2:
                retries += 1
                await asyncio.sleep(2)
                continue
            else:
                break

        await page  # Waiting for page operation to complete.

        # Loop through gathered requests and get its response body
        async with self.lock:
            for request in self.requests:
                try:
                    if not isinstance(request[1], cdp.network.RequestId):
                        raise ValueError('Request ID is not of type RequestId')

                    res = await page.send(cdp.network.get_response_body(request[1]))
                    if res is None:
                        continue
                    responses.append({
                        'url': request[0],
                        'body': res[0],  # Assuming res[0] is the response body
                        'is_base64': res[1]  # Assuming res[1] indicates if response is base64 encoded
                    })
                except Exception as e:
                    print('Error getting body', e)

        return responses

async def crawl():
    browser = await uc.start(headless=False)
    monitor = RequestMonitor()

    tab = await browser.get('about:blank')

    await monitor.listen(tab)

    # Change URL based on use case.
    tab = await browser.get('https://bing.com')

    xhr_responses = await monitor.receive(tab)

    # Print URL and response body
    for response in xhr_responses:
        print(f'URL: {response['url']}')
        print('Response Body:')
        print(response['body'] if not response['is_base64'] else 'Base64 encoded data')

if __name__ == '__main__':
    uc.loop().run_until_complete(crawl())

Hope this helps the next person that comes across this issue :)

Thanks for pointing out those mistakes! I was experimenting with the code a lot and made those errors while commenting - apologies to anyone who tried the broken code😅.

makovez commented 2 days ago

So is there no way at all to run res = await page.send(cdp.network.get_response_body(request[1])) this inside the ResponseReceived handler ?

utam-1 commented 11 hours ago

So is there no way at all to run res = await page.send(cdp.network.get_response_body(request[1])) this inside the ResponseReceived handler ?

Maybe there is, but as of now there is no workaround that. Apart from the approach suggested by @falmar and the subsequent improvements suggested by me and @jensmogens, I also devised another approach. The idea is to have separate co-routines to handle processing of response body and that of finished events. This is done using asyncio queue. The approach basically follows producer-consumer pattern. However, it would be of no use to post that method here I believe as we've already tried bifurcating the co-routines to achieve our goal.