scrapy-plugins / scrapy-playwright

🎭 Playwright integration for Scrapy
BSD 3-Clause "New" or "Revised" License
1.03k stars 113 forks source link

Scrapy Playwright load chrome extensions and configure them #310

Open milan-cp-dev opened 3 months ago

milan-cp-dev commented 3 months ago

Do you have ready to go method to init chrome extension of captcha service and configure it before visiting the page and obtaining page context?

elacuesta commented 3 months ago

playwright_page_init_callback might be useful. According to these upstream docs you need to access the context, which you can with the following page init callback:

async def init_page(page, request):
    context = page.context
    if len(context.background_pages) == 0:
        background_page = await context.wait_for_event('backgroundpage')
    else:
        background_page = context.background_pages[0]

Otherwise you'll need to elaborate on your use case.

milan-cp-dev commented 3 months ago

Thanks! Will look into it.

milan-cp-dev commented 3 months ago

Hello,

Goal is to load chrome extensions. I have minimum reproducible example. I still can’t figure out how to load extensions. One example that loads any extension would be greatly appreciated.

My code uses scrapy-playwright to make request with persistent context and attempts to load chrome extension.

Chrome extension is obtained from: https://antcpt.com/eng/home.html https://anti-captcha.com/ https://github.com/anti-captcha-plugin/anti-captcha-plugin?tab=readme-ov-file Chrome extension updated API key in config_ac_api_key.js file inside js folder from anticaptcha-plugin_v0.67.zip anticaptcha-plugin_v0.67.zip

Following commands are executed: scrapy startproject playwrightextensions cd playwrightextensions CaptchaSpider.py added in spiders CaptchaSpider.py.zip xvfb-run -a scrapy crawl CaptchaSpider

Expectations: Extension loaded, attempt to resolve captcha recorded Reality: Extension doesn’t load CaptchaSpider

Test done with clean playwright: playwrightextensions.py playwrightextensions.py.zip

Tests done with same anticaptcha-plugin_v0.67 folder inside clean playwright as well as in regular Chrome browser: Extension loaded, attempt to resolve captcha recorded playwrightextensions

Versions:

playwright --version Version 1.39.0 python -c "import scrapy_playwright; print(scrapy_playwright.version)" 0.0.36

scrapy version -v INFO:scrapy.utils.log:Scrapy 2.11.2 started (bot: playwrightextensions) INFO:scrapy.utils.log:Versions: lxml 4.9.2.0, libxml2 2.9.14, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.3.0, Python 3.11.5 (main, Jun 26 2024, 21:00:36) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)], pyOpenSSL 24.1.0 (OpenSSL 3.2.2 4 Jun 2024), cryptography 42.0.8, Platform Linux-5.14.0-284.25.1.el9_2.x86_64-x86_64-with-glibc2.34 Scrapy : 2.11.2 lxml : 4.9.2.0 libxml2 : 2.9.14 cssselect : 1.2.0 parsel : 1.9.1 w3lib : 2.2.1 Twisted : 24.3.0 Python : 3.11.5 (main, Jun 26 2024, 21:00:36) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)] pyOpenSSL : 24.1.0 (OpenSSL 3.2.2 4 Jun 2024) cryptography : 42.0.8 Platform : Linux-5.14.0-284.25.1.el9_2.x86_64-x86_64-with-glibc2.34