Open rasert opened 1 month ago
Found this in the logs:
2024-05-16 16:17:16 [scrapy-playwright] DEBUG: [Context=default] Overridden method for Playwright request to ...: original=GET new=POST
Seems like a bug with the method override (https://github.com/scrapy-plugins/scrapy-playwright/pull/177), I got expected behavior by commenting out these two lines.
This happens because the site makes a POST request to the same URL after each radio button click and that triggers this logic. It's a tricky problem to recognize which Playwright request corresponds to the Scrapy request and I've attempted a few ways, at this point I'm not sure exactly how to solve it once and for all (other than having some meta key like no_method_override
, which I don't really like).
However, there's a workaround: by making the first request as POST the methods match and there is no override.
You also need PLAYWRIGHT_PROCESS_REQUEST_HEADERS=None
in your settings, otherwise the page does not reload (it probably doesn't recognize the headers as correct).
@elacuesta I've had the same problem and spent a lot of hours debugging :D
The site was malfunctioning about the sent headers, PLAYWRIGHT_PROCESS_REQUEST_HEADERS=None
fixed the problem, huge thanks!
@elacuesta I found this issue after debugging for hours and hours. Setting PLAYWRIGHT_PROCESS_REQUEST_HEADERS=None
does something good (all JS ends up loading and __doPostBack
is available) but then the app is not working; I suspect a cookie is not being set.
Any other workarounds?
I suspect a cookie is not being set.
This shouldn't be the case with PLAYWRIGHT_PROCESS_REQUEST_HEADERS=None
, the handler will not override headers in this case.
Any other workarounds?
Hard to know without seeing what you're trying to do.
@elacuesta thanks, I'll debug this further with PLAYWRIGHT_PROCESS_REQUEST_HEADERS=None
and report back.
If I cannot find it, I'll show a working script using playwright directly and other, not working, using scrapy-playwright
So, this are my findings:
PLAYWRIGHT_PROCESS_REQUEST_HEADERS=None
not all JS is loaded (see har scrapy-playwright.har )PLAYWRIGHT_PROCESS_REQUEST_HEADERS=None
all javascript is loaded but when the button is clicked, it request the same URL (see har scrapy-playwright-NO_PROCESS_HEADERS.har)I've uploaded the HAR files to: https://drive.google.com/drive/folders/1sxvG_Suh-XYg6DGHB-761DCW1cpnHaLS?usp=sharing
Code for the scrapy-playwright:
import scrapy
class TestSpider(scrapy.Spider):
name = "test"
custom_settings = {
"DOWNLOAD_HANDLERS": {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},
"TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
"PLAYWRIGHT_PROCESS_REQUEST_HEADERS": None,
"PLAYWRIGHT_LAUNCH_OPTIONS": {
"headless": False,
"proxy": {"server": "http://127.0.0.1:8888"},
"timeout": 120 * 1000, # 20 seconds
},
}
def start_requests(self):
# GET request
yield scrapy.Request("https://ccms.clerk.org/", meta={"playwright": True, "playwright_include_page": True,
"playwright_context_kwargs": { "record_har_mode": "full", "record_har_path":"/tmp/scrapy-playwright-NO_PROCESS_HEADERS.har"}
})
async def parse(self, response, **kwargs):
page=response.meta['playwright_page']
print("click")
await page.locator("#Content1_button_accept").click()
print(await page.title())
await page.wait_for_selector("#Content1_CaseNum")
return {"url": response.url}
Code for the playwright only test script:
import asyncio
import playwright
from playwright.async_api import (
async_playwright,
TimeoutError as PlaywrightTimeoutError,
)
async def main():
async with async_playwright() as p:
browser = await p.chromium.launch( headless=False, proxy={"server": "http://127.0.0.1:8888"})
context = await browser.new_context( record_har_mode="full", record_har_path="/tmp/playwright.har",)
page = await context.new_page()
await page.goto('https://ccms.clerk.org/')
print("click")
await page.locator("#Content1_button_accept").click()
print( await page.title())
await page.wait_for_selector("#Content1_CaseNum")
await context.close()
asyncio.run(main())
Unfortunately the website only works for US IP addresses (I'm using a proxy)
Thanks in advance!
After clicking two radio buttons, the page should post-back and display a form. Unfortunately this is not happening. In regular playwright it works. I can't understand why.
This is the broken code:
And this is the pure Playwright working code: