scrapy-plugins / scrapy-playwright

🎭 Playwright integration for Scrapy
BSD 3-Clause "New" or "Revised" License
1.04k stars 112 forks source link

(Probable Playwright Request Overridden issue) Scrapy-playwright doesnot seem to work on website "njcourts" but playwright works #199

Closed Binit-Dhakal closed 1 year ago

Binit-Dhakal commented 1 year ago

Description

I am trying to scrape the website "https://portal.njcourts.gov/webe40/JudgmentWeb/jsp/judgmentSearch.faces" but I cannot proceed any further than the homepage using scrapy_playwright but can do all operations with Playwright. If I click on any of the navigation tabs or click search, I get redirected to the page attached in the image.[the URL is the same as above]. This is not the issue of website blocking us as I can make this work using playwright as soon below. njcourts_error

Steps to Reproduce

Scrapy-Playwright Code

class NjcourtsSpider(scrapy.Spider):
    """
    Class that scrapes the njcourts.gov.
    """
    name = 'njcourts2'
    # settings to scrape slowly
    custom_settings = {
        'DOWNLOAD_DELAY': 1,
        'CONCURRENT_REQUESTS': 1,
        'COOKIES_DEBUG': True,
        'PLAYWRIGHT_PROCESS_REQUEST_HEADERS': None
    }

    def start_requests(self):
        yield scrapy.Request(
            "https://portal.njcourts.gov/webe40/JudgmentWeb/jsp/judgmentSearch.faces",
            meta={
                'playwright': True,
                "playwright_include_page": True,
            }
        )

    async def parse(self, response):
        page = response.meta['playwright_page']

        judgement_num = page.locator("""
            //a[@onclick="return myfaces.oam.submitForm('judgmentSearchForm','judgmentSearchForm:j_id_jsp_1959880460_15');"]
        """)

        print(await judgement_num.count())  # => 1
        await judgement_num.click()

        await page.wait_for_timeout(10000)  # redirect to page to the image attached above

Vanilla Playwright code

from playwright.async_api import async_playwright

playwright = await async_playwright().start()

browser = await playwright.chromium.launch(headless=False)
context = await browser.new_context()
page = await context.new_page()
await page.goto("https://portal.njcourts.gov/webe40/JudgmentWeb/jsp/judgmentSearch.faces")

judgement_num = page.locator("""
    //a[@onclick="return myfaces.oam.submitForm('judgmentSearchForm','judgmentSearchForm:j_id_jsp_1959880460_15');"]
""")

print(await judgement_num.count())  # => 1
await judgement_num.click()  # This works

await page.wait_for_timeout(10000)

Versions

playwright-python: 1.32.1
scrapy-playwright: 0.0.26
scrapy: 2.7.1

Additional Information

The site seems to only work for American IPs.

If you cannot reproduce the issue or need more information, please let me know. I will appreciate a lot if you can point me in the right direction from here.

Thank you, Binit

Binit-Dhakal commented 1 year ago

I dug deeper into the issue and found the similar issue/bug in #100 and it seems to be closed after new pull request https://github.com/scrapy-plugins/scrapy-playwright/pull/144/files. But I think the issue still is not resolved. This is the part of the log file where this happens. Maybe this is the cause?

2023-05-15 09:05:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://portal.njcourts.gov/webe40/JudgmentWeb/jsp/judgmentSearch.faces> (referer: https://portal.njcourts.gov/webe40/JudgmentWeb/jsp/judgmentSearch.faces) ['playwright']
2023-05-15 09:05:35 [scrapy-playwright] DEBUG: [Context=default] Request: <POST https://portal.njcourts.gov/webe40/JudgmentWeb/jsp/judgmentSearch.faces> (resource type: document, referrer: https://portal.njcourts.gov/webe40/JudgmentWeb/jsp/judgmentSearch.faces)
2023-05-15 09:05:35 [scrapy-playwright] DEBUG: [Context=default] Overridden method for Playwright request to https://portal.njcourts.gov/webe40/JudgmentWeb/jsp/judgmentSearch.faces: original=POST new=GET
2023-05-15 09:05:36 [scrapy-playwright] DEBUG: [Context=default] Response: <400 https://portal.njcourts.gov/webe40/JudgmentWeb/jsp/judgmentSearch.faces> (referrer: None)
Binit-Dhakal commented 1 year ago

I checked the har using browser context

PLAYWRIGHT_CONTEXTS = {
    "har_saver": {
        "record_har_path": "pw.har"
    }
}

The original request should have been POST, but with playwright-scrapy, the request is shown as GET. This is the result of this bug. Is there a way to just not modify the request with playwright-scrapy or is this something necessary for the library to work?

I will appreciate it if you can point me in the right direction and let me know if this is the real issue.

I feel like something is wrong in this conditional and we can just change the request if it is scrapy.Request, else is it necessary to change the request method? I will love to hear why this decision was made. https://github.com/scrapy-plugins/scrapy-playwright/blob/main/scrapy_playwright/handler.py#L505

Thank you,

elacuesta commented 1 year ago

The code you mentioned in your comment was updated in #177 and has not been released yet. It's likely that it will actually solve your issue, I suspect that your POST request is probably not a navigation request, so it will not trigger the block that overrides the method.

elacuesta commented 1 year ago

177 was just released as part of v0.0.27.

Closing, feel free to reopen if you continue to experience the behavior.