scrapy-plugins / scrapy-playwright

🎭 Playwright integration for Scrapy
BSD 3-Clause "New" or "Revised" License
992 stars 108 forks source link

Inconsistent behavior between scrapy_playwright and playwright when accessing web pages #265

Closed LoyAngel closed 6 months ago

LoyAngel commented 6 months ago

Hello,

I'm experiencing an inconsistency between scrapy_playwright and playwright when accessing web pages. While I can access web pages without any issues using playwright directly, I encounter a problem when using the scrapy_playwright framework. The web page detects a lower browser version and triggers a browser version warning. I would like to understand the difference between the two approaches that could be causing this behavior. I have provided details of my environment setup and the source code for two separate tests below:

Environment Setup:

Operating System: Ubuntu 11.04
Python version: 3.9.2
Python packages:
playwright==1.42.0
Twisted==22.10.0
Scrapy==2.9.0
scrapy-playwright==0.0.34

Using Twitter for tesing below.

Source Code for Test 1 (using playwright directly):

from playwright.async_api import async_playwright

async def main():
    urls = "https://www.twitter.com"
    async with async_playwright() as p:
        browser = await p.chromium.launch()
        page = await browser.new_page()
        await page.goto(urls)
        await page.screenshot(path="example.png")
        await browser.close()

if __name__ == "__main__":
    import asyncio
    asyncio.run(main())

Result: example

Source Code for Test 2 (using scrapy_playwright):

import scrapy

class PwTestSpider(scrapy.Spider):
    name = "pw_test"

    def start_requests(self):
        # GET request
        url = "https://www.twitter.com"
        request_meta = {
            "playwright": True,
            "playwright_include_page": True,
            "playwright_context_kwargs": {},
            "playwright_page_goto_kwargs": {"wait_until": "commit"},
            "handle_httpstatus_all": True
        }
        yield scrapy.Request(url, meta=request_meta, dont_filter=True)

    async def parse(self, response, **kwargs):
        # 'response' contains the page as seen by the browser
        page = response.meta["playwright_page"]
        await page.screenshot(path="screenshot.png")
        return {"url": response.url}

Result: screenshot

I have compared the two test cases and cannot identify any significant differences that could explain this inconsistency. Therefore, I would appreciate any insights or guidance on why this discrepancy is occurring.

Thank you for any help!

elacuesta commented 6 months ago

By default you get Scrapy's user agent and it seems like the site does not like that. You can verify it by requesting https://httpbin.org/headers. See the section about the user agent header in the docs.

LoyAngel commented 6 months ago

The problem has been successfully resolved. Thank you very, very much!!!

LoyAngel commented 6 months ago

Thank you, buddy, for your help. I really appreciate it.