scrapy-plugins / scrapy-playwright

🎭 Playwright integration for Scrapy
BSD 3-Clause "New" or "Revised" License
992 stars 108 forks source link

How to block specific requests in scrapy-playwright based on URL patterns? #284

Closed no2catisme closed 3 months ago

no2catisme commented 3 months ago

I'm using scrapy-playwright and I want to block certain requests based on URL patterns. Specifically, I need to: 1. Block requests that contain 'img' in the URL2. Block requests that contain 'Analytics' in the URL.

2024-07-01 23:13:59 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://img.example.com/next-product/2021/05/26/46c6b943487440f5911d7d47f113654e_20210526094614.jpg?width=600> (resource type: image, referrer: https://shop.example.com/)
2024-07-01 23:13:59 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://img.example.com/item/202310/11ee77cc7e0b7deb83bc876197299e47.jpg?width=600> (resource type: image, referrer: https://shop.example.com/)
2024-07-01 23:13:59 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://img.example.com/next-product/2023/03/14/87635b7167914c1899c67c753057f50a_20230314091454.jpg?width=600> (resource type: image, referrer: https://shop.example.com/)
2024-07-01 23:14:00 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://www.google-analytics.com/privacy-sandbox/register-conversion?_c=1&cid=616391787.1719843239&dbk=17579890348778814022&dma=0&en=page_view&gtm=45je46q0v874388414za200&npa=0&tid=G-TZEE8GDKJ6&dl=https%3A%2F%2Fshop.29cm.co.kr%3F> (resource type: fetch)

Questions:

Is there a built-in way to block specific requests in scrapy-playwright? I'm not sure how to implement request blocking based on URL patterns in scrapy-playwright.

If not, what's the recommended approach to achieve this? Should I use Playwright's request interception capabilities, and if so, how can I integrate this with scrapy-playwright?

Can you provide a code example of how to implement this in my spider?

Thank you for your help!

elacuesta commented 3 months ago

https://github.com/scrapy-plugins/scrapy-playwright/tree/v0.0.36?tab=readme-ov-file#playwright_abort_request