Closed zaheerkzz closed 3 months ago
This is not specific to Scrapy Playwright, it's standard URL manipulation done by upstream Scrapy (specifically in scrapy.http.Request._set_url, invoking scrapy.utils.url.escape_ajax):
$ python
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import scrapy
>>> url = "https://example.org/foo.bar#!/a/b/c"
>>> scrapy.Request(url)
<GET https://example.org/foo.bar?_escaped_fragment_=%2Fa%2Fb%2Fc>
>>> scrapy.utils.url.escape_ajax(url)
'https://example.org/foo.bar?_escaped_fragment_=%2Fa%2Fb%2Fc'
>>>
I have an issue when i load my url using
scrapy.Request()
it changes URL from https://store.servicenow.com/sn_appstore_store.do#!/store/application/0ea3c3d1db7232006bf8ffa31d96190fto
2024-03-26 20:47:49 [scrapy.core.scraper] ERROR: Spider error processing <GET https://store.servicenow.com/sn_appstore_store.do?_escaped_fragment_=%2Fstore%2Fapplication%2F0ea3c3d1db7232006bf8ffa31d96190f> (referer: None)
_escaped_fragment_=
replaces#!
Is there any way it not change it? As it missing the page data
This is my code:
custom_settings = { "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor", 'CONCURRENT_REQUESTS': 1, 'DOWNLOAD_DELAY': 3, 'PLAYWRIGHT_BROWSER_TYPE': 'chromium', 'PLAYWRIGHT_PAGE_METHODS': { 'default': DisableJavaScript, }, "DOWNLOAD_HANDLERS": { "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler", "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler", }, "PLAYWRIGHT_LAUNCH_OPTIONS" : { "headless": False, 'args': ['--disable-web-security', '--no-sandbox', '--disable-features=IsolateOrigins,site-per-process'], # Arguments to mimic regular browser settings }, 'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36', "PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT" : 1000000 }
Thanks