scrapy-plugins / scrapy-playwright

🎭 Playwright integration for Scrapy
BSD 3-Clause "New" or "Revised" License
911 stars 101 forks source link

my URL changes when scrapy.request #267

Closed zaheerkzz closed 3 months ago

zaheerkzz commented 3 months ago

I have an issue when i load my url using scrapy.Request() it changes URL from https://store.servicenow.com/sn_appstore_store.do#!/store/application/0ea3c3d1db7232006bf8ffa31d96190f

to 2024-03-26 20:47:49 [scrapy.core.scraper] ERROR: Spider error processing <GET https://store.servicenow.com/sn_appstore_store.do?_escaped_fragment_=%2Fstore%2Fapplication%2F0ea3c3d1db7232006bf8ffa31d96190f> (referer: None)

_escaped_fragment_= replaces #!

Is there any way it not change it? As it missing the page data

This is my code:

custom_settings = { "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor", 'CONCURRENT_REQUESTS': 1, 'DOWNLOAD_DELAY': 3, 'PLAYWRIGHT_BROWSER_TYPE': 'chromium', 'PLAYWRIGHT_PAGE_METHODS': { 'default': DisableJavaScript, }, "DOWNLOAD_HANDLERS": { "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler", "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler", }, "PLAYWRIGHT_LAUNCH_OPTIONS" : { "headless": False, 'args': ['--disable-web-security', '--no-sandbox', '--disable-features=IsolateOrigins,site-per-process'], # Arguments to mimic regular browser settings }, 'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36', "PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT" : 1000000 }

`def start_requests(self):
        for url in testing_urls:
            yield scrapy.Request(
                url=fragment_url,
                callback=self.parse_item,
                meta={ 
                      "playwright": True,
                      "PLAYWRIGHT_URL": url,
                      "PLAYWRIGHT_OPTIONS": {
                            "websockets": False
                        }
                    }
            )`

Thanks

elacuesta commented 3 months ago

This is not specific to Scrapy Playwright, it's standard URL manipulation done by upstream Scrapy (specifically in scrapy.http.Request._set_url, invoking scrapy.utils.url.escape_ajax):

$ python
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import scrapy
>>> url = "https://example.org/foo.bar#!/a/b/c"
>>> scrapy.Request(url)
<GET https://example.org/foo.bar?_escaped_fragment_=%2Fa%2Fb%2Fc>
>>> scrapy.utils.url.escape_ajax(url)
'https://example.org/foo.bar?_escaped_fragment_=%2Fa%2Fb%2Fc'
>>>