seleniumbase / SeleniumBase

📊 Python's all-in-one framework for web crawling, scraping, testing, and reporting. Supports pytest. UC Mode provides stealth. Includes many tools.
https://seleniumbase.io
MIT License
4.46k stars 909 forks source link

CDP events partially resolved in Undecteted mode #2797

Closed bjornkarlsson closed 1 month ago

bjornkarlsson commented 1 month ago

Given the task to retrieve the 'Network.responseReceived' for the url being requested, this is not possible when using the uc mode.

Example code highlighting the issue.

import json
import time
from seleniumbase import Driver

def _responses(messages):
    responses = {}
    for m in messages:
        message = m['message']['message']
        if message['method'] == 'Network.responseReceived':
            params = message['params']
            responses[params['response']['url']] = params
    return responses

def main():
    url = 'https://www.metalreviews.com/reviews/album/10355'

    with Driver(uc=True,
                log_cdp=True,
                ) as driver:
        driver.open(url)

        # Regularly obtaining performance logs as the standard Selenium Driver
        logs = [dict(m, message=json.loads(m['message'])) for m in driver.get_log('performance')]
        responses = _responses(logs)

        print(responses[url])  # OK

    with Driver(uc=True,
                log_cdp=True,
                ) as driver:
        driver.uc_open_with_reconnect(url)

        logs = [dict(m, message=json.loads(m['message'])) for m in driver.get_log('performance')]
        assert logs  # These are either empty or contains a limited set of messages compared to standard mode
        responses = _responses(logs)

        try:
            print(responses[url])  # Key Error
        except KeyError:
            pass

    # Same problem using a cdp_listener
    logs = []

    def add_log(m):
        m = dict(m, message=json.loads(m['message']))
        logs.append(m)

    with Driver(uc=True,
                log_cdp=True,
                ) as driver:
        driver.add_cdp_listener("Network.responseReceived", add_log)
        driver.uc_open_with_reconnect(url)
        time.sleep(2)
        responses = _responses(logs)

        try:
            print(responses[url])  # Key Error
        except KeyError:
            pass

if __name__ == '__main__':
    main()

Using untedeteched-chrome directly I have the same exact same issue:

    import undetected_chromedriver as uc

    options = uc.ChromeOptions()
    options.set_capability('goog:loggingPrefs', {'performance': 'ALL'})

    driver = uc.Chrome(executable_path='/opt/homebrew/bin/chromedriver',
                       options=options)

    url = 'https://www.metalreviews.com/reviews/album/10355'
    with driver:
        driver.get(url)
        time.sleep(2)
        logs = driver.get_log('performance')

        responses = _responses(logs)

        print(responses[url])  # OK

It's unclear to me wether seleniumbase is a fork/continuation of undetected-chromedriver, as it has been discontinued, as such if this is a real issue (and not using the api wrong) could be fixed in this codebase, otherwise this feature remains broken?

I also tried to activating the uc_cdp flag but helds the same result.

Thanks for your support!

mdmintz commented 1 month ago

Looks like a duplicate of https://github.com/seleniumbase/SeleniumBase/issues/2162#issuecomment-1851998232.

You will need to do a refresh() to get some logs because otherwise some logs are lost during the disconnect/reconnect process of UC Mode where the driver is disconnected from the browser.

bjornkarlsson commented 1 month ago

Tested, and seems fine. Would the refresh hit the browser cache with the default options, are are there any options to enforce that?

That is mainly to halve the amount of requests that could be performed for a rate limited site in a certain timespan.

mdmintz commented 1 month ago

Refreshing the page will keep the options that were already set when you launched the web browser, plus any new ones that were added or changed via driver.execute_cdp_cmd(), such as for changing the GeoLocation. There's a good example of that GeoLocation changing here: SeleniumBase/examples/test_geolocation.py. I would experiment to learn more. Be sure to try out the various examples in the SeleniumBase/examples folder.