seleniumbase / SeleniumBase

📊 Python's all-in-one framework for web crawling, scraping, testing, and reporting. Supports pytest. UC Mode provides stealth. Includes many tools.
https://seleniumbase.io
MIT License
4.46k stars 909 forks source link

Turn on ad privacy feature appears after repeated usage #2799

Closed bjornkarlsson closed 1 month ago

bjornkarlsson commented 1 month ago
image

This appears after resolving a few urls from the same domain (about 5 calls before the pop up show up), each call is followed by the following instructions to reset session and cookies as following:

        driver.execute_script('window.localStorage.clear(); window.sessionStorage.clear();')
        driver.delete_all_cookies()
        driver.get('about:blank')

This is required because after only a few requests within the same session (in a frequency of 30 seconds) will result in a ban of my ip. Pretty aggressive blocking.

I am not setting any user-data directory as I have seen this issue was resolved:

https://github.com/seleniumbase/SeleniumBase/issues/2201

Is there any way to pass ChromeOptions directly to the Driver instantiation?

mdmintz commented 1 month ago

The --disable-features=PrivacySandboxSettings4 is already set to prevent that message. I'm not sure what you're doing exactly to reach it. Would need a full example that reproduces the issue.

As for setting ChromeOptions, that's a duplicate of https://github.com/seleniumbase/SeleniumBase/discussions/2482#discussioncomment-8434762.

bjornkarlsson commented 1 month ago

Apologies, this is an example that will replicate the issue, usually at the third iteration.

import json
import time

from seleniumbase import Driver

def main():
    url = 'https://rateyourmusic.com/list/novocaine69/3_500-albums-you-gotta-listen-to-ere-pushing-up-daisies-6th-edition/'

    with Driver(uc=True,
                log_cdp=True,
                ) as driver:
        for i in range(1, 20):
            driver.get(f'{url}{i}/')
            time.sleep(10)
            # reset the session and enforce to a start a new one, to avoid been blocked from the site
            driver.execute_script('window.localStorage.clear(); window.sessionStorage.clear();')
            driver.delete_all_cookies()
            driver.get('about:blank')

I just checked that it only happens with site that have ads

mdmintz commented 1 month ago

Here's a better script that does what you want:

from rich.pretty import pprint
from seleniumbase import SB

url = "https://rateyourmusic.com/list/novocaine69/3_500-albums-you-gotta-listen-to-ere-pushing-up-daisies-6th-edition/"
for i in range(5):
    with SB(uc=True, log_cdp=True, ad_block_on=True) as sb:
        sb.driver.uc_open_with_reconnect(url, 2)
        sb.sleep(3)
        pprint(sb.driver.get_log("performance"))

Be sure to go through the examples in the SeleniumBase/examples folder for more optimal UC Mode strategies.

bjornkarlsson commented 1 month ago

Thanks, problem arises when keeping the driver open:

url = "https://rateyourmusic.com/list/novocaine69/3_500-albums-you-gotta-listen-to-ere-pushing-up-daisies-6th-edition/"
    with SB(uc=True, log_cdp=True, ad_block_on=True) as sb:
        for i in range(1, 21):
            pagination_url = f'{url}{i}/'
            sb.driver.uc_open_with_reconnect(pagination_url, 2)
            sb.sleep(3)
            # reset the session and enforce to a start a new one, to avoid been blocked from the site
            sb.driver.execute_script('window.localStorage.clear(); window.sessionStorage.clear();')
            sb.driver.delete_all_cookies()
            sb.driver.get('about:blank')

(Bare in mind the url are paginating with an integer index at the end)

Which leads to the ad privacy feature being showed after a couple of iterations.

The example that you posted should have the same semantics, no user-data dir has been set so when reopening the browser I'd expect a fresh new session. Great, that should also work, don't mind reopening the browser at each request.

Is still worth investigating thought why the ad privacy windows shows up when the driver is open? Cheers

mdmintz commented 1 month ago

With pagination:

from seleniumbase import SB

url = "https://rateyourmusic.com/list/novocaine69/3_500-albums-you-gotta-listen-to-ere-pushing-up-daisies-6th-edition/"
for i in range(10):
    with SB(uc=True, log_cdp=True, ad_block_on=True) as sb:
        sb.driver.uc_open_with_reconnect("%s%s" % (url, i))
        sb.sleep(3)
bjornkarlsson commented 1 month ago

OK. I will switch with the above formula, but as general information:

    with SB(uc=True, log_cdp=True, ad_block_on=True) as sb:
        for i in range(10):
             sb.driver.uc_open_with_reconnect("%s%s" % (url, i))
             sb.sleep(3)
             sb.driver.get('about:blank')

When one session is open, and going to the blank page at each iteration, then we see the ad privacy window appearing. Did not happen with standard selenium, but feels like an extreme corner case, so i am not sure if it's worth investigating further.

mdmintz commented 1 month ago

Use data:, instead of about:blank -->

from seleniumbase import SB

url = "https://rateyourmusic.com/list/novocaine69/3_500-albums-you-gotta-listen-to-ere-pushing-up-daisies-6th-edition/"
with SB(uc=True, log_cdp=True, ad_block_on=True) as sb:
    for i in range(5):
         sb.driver.uc_open_with_reconnect("%s%s" % (url, i))
         sb.sleep(2)
         sb.driver.get("data:,")
mdmintz commented 1 month ago

Also, adding --disable-background-networking or --disable-features=PrivacySandboxSettings4 to disable the Ad Privacy popup adds this new pop-up shown below, which shows up everywhere and not just under special conditions.

Screenshot 2024-05-24 at 12 11 53 PM

Therefore, the best option is probably using sb.driver.get("data:,") instead of sb.driver.get("about:blank") to load an empty page.

mdmintz commented 1 month ago

See https://github.com/GoogleChrome/chrome-launcher/blob/main/docs/chrome-flags-for-tools.md for all the Chromium command-line switches, which can be passed in via the SeleniumBase chromium-arg if not already specified.