Open junior-g opened 2 weeks ago
I'm sorry, I cannot reproduce:
# test.py
import scrapy
class TestSpider(scrapy.Spider):
name = "test"
custom_settings = {
"TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
"DOWNLOAD_HANDLERS": {
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},
"PLAYWRIGHT_LAUNCH_OPTIONS": {
"headless": False,
},
"USER_AGENT": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36",
"LOG_LEVEL": "INFO",
}
def start_requests(self):
yield scrapy.Request(
url="URL",
meta={
"playwright": True,
"playwright_include_page": True,
},
)
async def parse(self, response):
page = response.meta["playwright_page"]
await page.screenshot(path="croma.png")
await page.close()
print("Response parsing")
print(response.xpath("//h1/text()").get())
$ scrapy runspider test.py
...
2024-11-12 16:50:44 [scrapy.core.engine] INFO: Spider opened
2024-11-12 16:50:44 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-11-12 16:50:44 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-11-12 16:50:44 [scrapy-playwright] INFO: Starting download handler
2024-11-12 16:50:44 [scrapy-playwright] INFO: Starting download handler
2024-11-12 16:50:49 [scrapy-playwright] INFO: Launching browser chromium
2024-11-12 16:50:49 [scrapy-playwright] INFO: Browser chromium launched
Response parsing
iFFALCON Q73 126 cm (50 inch) 4K Ultra HD QLED Google TV with Dolby Audio (2023 model)
2024-11-12 16:50:52 [scrapy.core.engine] INFO: Closing spider (finished)
...
Note that I had to set a custom User-Agent, otherwise I was getting 403 status responses.
Versions used:
$ playwright --version
Version 1.48.0
$ python -c "import scrapy_playwright; print(scrapy_playwright.__version__)"
0.0.42
$ scrapy version -v
Scrapy : 2.11.2
lxml : 5.2.2.0
libxml2 : 2.12.6
cssselect : 1.2.0
parsel : 1.9.1
w3lib : 2.2.1
Twisted : 24.3.0
Python : 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0]
pyOpenSSL : 24.2.1 (OpenSSL 3.3.1 4 Jun 2024)
cryptography : 43.0.0
Platform : Linux-6.5.0-45-generic-x86_64-with-glibc2.35
@elacuesta thanks for the quick reply. Yes it is working when working on project separately. but one more issue when I make
"PLAYWRIGHT_LAUNCH_OPTIONS": {
"headless": True,
},
the prints None
not the title.
why so?
I see. I suppose the site could be detecting and blocking headless browsers, I'm seeing the same behavior with standalone Playwright:
import asyncio
from playwright.async_api import async_playwright
async def main():
async with async_playwright() as pw:
browser = await pw.chromium.launch(headless=False)
page = await browser.new_page()
await page.goto(URL)
await page.screenshot(path="page.png")
print(await page.locator("//h1").text_content())
await browser.close()
if __name__ == "__main__":
asyncio.run(main())
prints iFFALCON Q73 126 cm (50 inch) 4K Ultra HD QLED Google TV with Dolby Audio (2023 model)
, however by passing headless=True
I get a 403 response with "Access Denied".
I am getting following error for my basic scrapy with playwright error:
I am following this - https://scrapeops.io/python-scrapy-playbook/scrapy-playwright/
Why I am getting this error
(edited to adjust formatting)