Closed diegocostares closed 9 months ago
Hi, I wanted to share a solution approach I've found for the problem I described. While I'm not entirely certain if it's the best method, it seems to work effectively in my context.
To handle multiple asynchronous Playwright pages within a Scrapy spider, I devised a method where I accumulate a list of scrapy.FormRequest
objects in the handle_request_logic
method. This process involves collecting all necessary requests per page and then iterating over this list to yield each request to Scrapy's engine. Here's how I implemented it:
async def parse_coupon(self, response):
profiles = ["PAGE 1", "PAGE 2", "PAGE 3"]
context = response.meta["playwright_page"].context
all_requests = []
for profile_name in profiles:
page = await context.new_page()
await page.goto("https://example.com")
requests = await self.handle_request_logic(page, profile_name)
all_requests.extend(requests)
await page.close()
for request in all_requests:
yield request
async def handle_request_logic(self, page, profile_name):
# Page setup and logic
requests = []
# ... logic to create scrapy.FormRequest ...
requests.append(scrapy.FormRequest(...))
return requests
This approach seems to resolve the issues I was facing, particularly the async_generator
error, and integrates well with Scrapy's processing workflow.
While this method works for my project, I'm not sure if it's the optimal solution. I'm posting it here in case it helps others facing similar challenges, and I would greatly appreciate any feedback, suggestions, or alternative approaches that might be more efficient or effective.
Thanks for any insights you can provide!
Sounds right. I think the parse_coupon
callback could be simplified a bit by letting scrapy-playwright handle the page creation without accessing the context directly. You could also just return the request list, no need to iterate over it.
The following might not match your existing code entirely, but could serve to illustrate:
class MySpider(scrapy.Spider):
def previous_callback(self, response):
profiles = ["PAGE 1", "PAGE 2", "PAGE 3"]
for profile_name in profiles:
yield scrapy.Request(
url="https://example.com",
callback=self.parse_coupon,
meta={"playwright": True, "playwright_include_page": True},
cb_kwargs={"profile_name": profile_name},
dont_filter=True, # needed if the URL is the same for all requests
)
async def parse_coupon(self, response, profile_name):
page = response.meta["playwright_page"]
requests = await self.handle_request_logic(page, profile_name)
await page.close()
return requests
async def handle_request_logic(self, page, profile_name):
...
(edit) added dont_filter
arg
In your example, wouldn't executing the previous_callback function result in only the first profile being run? Since when performing the for loop, it would return the value of the first yield because it's not asynchronous, right?
No, the three requests are enqueued sequentially in the for
loop, but their downloading and processing is handled asynchronously by Scrapy.
I understand that this should work ... but in my case that did not work :cry: Thank you very much
One thing that could be causing only the first request to work is Scrapy's duplicate filtering, see https://docs.scrapy.org/en/latest/topics/settings.html#dupefilter-class
OMG!!! It was that! Thank you so much! ❤️ ❤️ ❤️
Glad to hear, I edited my original comment to prevent future confusion. Should this issue be closed then?
yes, thank you very much for the help
Context
I am developing a scraper using Scrapy and Playwright in Python, where I need to open multiple web pages with different profiles (e.g., "EXTRACTOR 1", "EXTRACTOR 2", "EXTRACTOR 3") to expedite the scraping process. Each page is managed with its own logic and is expected to capture specific responses.
Current Workflow
async_api
to initiate new browser pages in a given context. This involves dynamically opening multiple pages for different profiles, such as "PAGE 1", "PAGE 2", and "PAGE 3".goto
method and perform profile-specific adjustments.handle_request_logic
method. This method sets up a request listener usingpage.on("request", self.handle_request)
and triggers certain actions on the page, like clicking and waiting for specific elements.handle_request_logic
method, I use Scrapy'sFormRequest
to send modified POST requests based on the page interactions. The goal is to capture and process the responses using Scrapy’s callback mechanism.Problem
I am encountering difficulties in integrating Playwright's asynchronous page handling with Scrapy's workflow. Specifically, I am experiencing a
TypeError: object async_generator can't be used in 'await' expression
when trying to manage multiple pages concurrently with Playwright within a Scrapy spider and ensuring that responses captured byscrapy.FormRequest
are processed correctly.Question
What is the recommended approach for managing multiple pages concurrently with Playwright within a Scrapy spider, ensuring that the responses captured through
scrapy.FormRequest
are appropriately processed?Simplified Code Example
I appreciate any guidance or suggestions for improving this workflow and ensuring effective and efficient scraping.