scrapy-plugins / scrapy-playwright

🎭 Playwright integration for Scrapy
BSD 3-Clause "New" or "Revised" License
992 stars 108 forks source link

Question from the optimizing Scrapy with Playwright for Concurrent Page Handling and Response Capture with async_generator TypeError #256

Closed diegocostares closed 9 months ago

diegocostares commented 9 months ago

Context

I am developing a scraper using Scrapy and Playwright in Python, where I need to open multiple web pages with different profiles (e.g., "EXTRACTOR 1", "EXTRACTOR 2", "EXTRACTOR 3") to expedite the scraping process. Each page is managed with its own logic and is expected to capture specific responses.

Current Workflow

  1. Within a Scrapy spider, I'm using Playwright's async_api to initiate new browser pages in a given context. This involves dynamically opening multiple pages for different profiles, such as "PAGE 1", "PAGE 2", and "PAGE 3".
  2. For each of these new pages, I navigate to a specific URL using Playwright's goto method and perform profile-specific adjustments.
  3. On each page, I implement a handle_request_logic method. This method sets up a request listener using page.on("request", self.handle_request) and triggers certain actions on the page, like clicking and waiting for specific elements.
  4. In the handle_request_logic method, I use Scrapy's FormRequest to send modified POST requests based on the page interactions. The goal is to capture and process the responses using Scrapy’s callback mechanism.

Problem

I am encountering difficulties in integrating Playwright's asynchronous page handling with Scrapy's workflow. Specifically, I am experiencing a TypeError: object async_generator can't be used in 'await' expression when trying to manage multiple pages concurrently with Playwright within a Scrapy spider and ensuring that responses captured by scrapy.FormRequest are processed correctly.

Question

What is the recommended approach for managing multiple pages concurrently with Playwright within a Scrapy spider, ensuring that the responses captured through scrapy.FormRequest are appropriately processed?

Simplified Code Example

async def parse_coupon(self, response):
    profiles = ["PAGE 1", "PAGE 2", "PAGE 3"]
    context = response.meta["playwright_page"].context
    for profile_name in profiles:
        page = await context.new_page()
        await page.goto("https://example.com")
        await self.handle_request_logic(page, profile_name) # I hope to obtain items from each page, but I'm not sure how to do it.
        await page.close()

async def handle_request_logic(self, page, profile_name):
    # Page setup and logic
    # ...
    # yield scrapy.FormRequest(...)

I appreciate any guidance or suggestions for improving this workflow and ensuring effective and efficient scraping.

diegocostares commented 9 months ago

Hi, I wanted to share a solution approach I've found for the problem I described. While I'm not entirely certain if it's the best method, it seems to work effectively in my context.

Solution I Found

To handle multiple asynchronous Playwright pages within a Scrapy spider, I devised a method where I accumulate a list of scrapy.FormRequest objects in the handle_request_logic method. This process involves collecting all necessary requests per page and then iterating over this list to yield each request to Scrapy's engine. Here's how I implemented it:

async def parse_coupon(self, response):
    profiles = ["PAGE 1", "PAGE 2", "PAGE 3"]
    context = response.meta["playwright_page"].context
    all_requests = []
    for profile_name in profiles:
        page = await context.new_page()
        await page.goto("https://example.com")
        requests = await self.handle_request_logic(page, profile_name)
        all_requests.extend(requests)
        await page.close()

    for request in all_requests:
        yield request

async def handle_request_logic(self, page, profile_name):
    # Page setup and logic
    requests = []
    # ... logic to create scrapy.FormRequest ...
    requests.append(scrapy.FormRequest(...))
    return requests

This approach seems to resolve the issues I was facing, particularly the async_generator error, and integrates well with Scrapy's processing workflow.

Seeking Feedback

While this method works for my project, I'm not sure if it's the optimal solution. I'm posting it here in case it helps others facing similar challenges, and I would greatly appreciate any feedback, suggestions, or alternative approaches that might be more efficient or effective.

Thanks for any insights you can provide!

elacuesta commented 9 months ago

Sounds right. I think the parse_coupon callback could be simplified a bit by letting scrapy-playwright handle the page creation without accessing the context directly. You could also just return the request list, no need to iterate over it. The following might not match your existing code entirely, but could serve to illustrate:

class MySpider(scrapy.Spider):
    def previous_callback(self, response):
        profiles = ["PAGE 1", "PAGE 2", "PAGE 3"]
        for profile_name in profiles:
            yield scrapy.Request(
                url="https://example.com",
                callback=self.parse_coupon,
                meta={"playwright": True, "playwright_include_page": True},
                cb_kwargs={"profile_name": profile_name},
                dont_filter=True,  # needed if the URL is the same for all requests
            )

    async def parse_coupon(self, response, profile_name):
        page = response.meta["playwright_page"]
        requests = await self.handle_request_logic(page, profile_name)
        await page.close()
        return requests

    async def handle_request_logic(self, page, profile_name):
        ...

(edit) added dont_filter arg

diegocostares commented 9 months ago

In your example, wouldn't executing the previous_callback function result in only the first profile being run? Since when performing the for loop, it would return the value of the first yield because it's not asynchronous, right?

elacuesta commented 9 months ago

No, the three requests are enqueued sequentially in the for loop, but their downloading and processing is handled asynchronously by Scrapy.

diegocostares commented 9 months ago

I understand that this should work ... but in my case that did not work :cry: Thank you very much

elacuesta commented 9 months ago

One thing that could be causing only the first request to work is Scrapy's duplicate filtering, see https://docs.scrapy.org/en/latest/topics/settings.html#dupefilter-class

diegocostares commented 9 months ago

OMG!!! It was that! Thank you so much! ❤️ ❤️ ❤️

elacuesta commented 9 months ago

Glad to hear, I edited my original comment to prevent future confusion. Should this issue be closed then?

diegocostares commented 9 months ago

yes, thank you very much for the help