Closed Binit-Dhakal closed 1 year ago
I dug deeper into the issue and found the similar issue/bug in #100 and it seems to be closed after new pull request https://github.com/scrapy-plugins/scrapy-playwright/pull/144/files. But I think the issue still is not resolved. This is the part of the log file where this happens. Maybe this is the cause?
2023-05-15 09:05:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://portal.njcourts.gov/webe40/JudgmentWeb/jsp/judgmentSearch.faces> (referer: https://portal.njcourts.gov/webe40/JudgmentWeb/jsp/judgmentSearch.faces) ['playwright']
2023-05-15 09:05:35 [scrapy-playwright] DEBUG: [Context=default] Request: <POST https://portal.njcourts.gov/webe40/JudgmentWeb/jsp/judgmentSearch.faces> (resource type: document, referrer: https://portal.njcourts.gov/webe40/JudgmentWeb/jsp/judgmentSearch.faces)
2023-05-15 09:05:35 [scrapy-playwright] DEBUG: [Context=default] Overridden method for Playwright request to https://portal.njcourts.gov/webe40/JudgmentWeb/jsp/judgmentSearch.faces: original=POST new=GET
2023-05-15 09:05:36 [scrapy-playwright] DEBUG: [Context=default] Response: <400 https://portal.njcourts.gov/webe40/JudgmentWeb/jsp/judgmentSearch.faces> (referrer: None)
I checked the har using browser context
PLAYWRIGHT_CONTEXTS = {
"har_saver": {
"record_har_path": "pw.har"
}
}
The original request should have been POST, but with playwright-scrapy, the request is shown as GET. This is the result of this bug. Is there a way to just not modify the request with playwright-scrapy or is this something necessary for the library to work?
I will appreciate it if you can point me in the right direction and let me know if this is the real issue.
I feel like something is wrong in this conditional and we can just change the request if it is scrapy.Request, else is it necessary to change the request method? I will love to hear why this decision was made. https://github.com/scrapy-plugins/scrapy-playwright/blob/main/scrapy_playwright/handler.py#L505
Thank you,
The code you mentioned in your comment was updated in #177 and has not been released yet. It's likely that it will actually solve your issue, I suspect that your POST request is probably not a navigation request, so it will not trigger the block that overrides the method.
Description
I am trying to scrape the website "https://portal.njcourts.gov/webe40/JudgmentWeb/jsp/judgmentSearch.faces" but I cannot proceed any further than the homepage using scrapy_playwright but can do all operations with Playwright. If I click on any of the navigation tabs or click search, I get redirected to the page attached in the image.[the URL is the same as above]. This is not the issue of website blocking us as I can make this work using playwright as soon below.
Steps to Reproduce
Scrapy-Playwright Code
Vanilla Playwright code
Versions
Additional Information
The site seems to only work for American IPs.
If you cannot reproduce the issue or need more information, please let me know. I will appreciate a lot if you can point me in the right direction from here.
Thank you, Binit