scrapy-plugins / scrapy-playwright

🎭 Playwright integration for Scrapy
BSD 3-Clause "New" or "Revised" License
1.02k stars 111 forks source link

List of available and non available instances for PageCoroutines #60

Closed lime-n closed 2 years ago

lime-n commented 2 years ago

I have a few concerns with the number of instances available to page when using playwright integration with scrapy.

Perhaps I have not yet fully understood the integration (lack of documentation does no help); However, I have found that the following are not compatible:

  1. waitForNavigation
  2. waitForUrl

AttributeError: 'Page' object has no attribute 'waitForNavigation'

I usually work a lot with fillings forms and clicking buttons that would redirect me to the next page. There does not seem to be a compatible instance to wait long enough until the page loads. I have tried waitForSelector, however this is not effective enough.

Here's what I am working with:

import scrapy
from scrapy_playwright.page import PageCoroutine

class DoorSpider(scrapy.Spider):
    name = 'door'
    start_urls = ['https://nextdoor.co.uk/login/']

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(
                url=url, 
                callback = self.parse, 
                meta= dict(
                        playwright = True,
                        playwright_include_page = True,
                        playwright_page_coroutines = [
                        PageCoroutine("click", selector = ".onetrust-close-btn-handler.onetrust-close-btn-ui.banner-close-button.onetrust-lg.ot-close-icon"),
                        PageCoroutine("waitForNavigation"),
                        PageCoroutine("fill", "#id_email", 'my_email'),
                        PageCoroutine("fill", "#id_password", 'my_password'),
                        PageCoroutine("waitForNavigation"),
                        PageCoroutine("click", selector="#signin_button"),
                        PageCoroutine("waitForNavigation"),
                        PageCoroutine("screenshot", path="cookies.png", full_page=True)
                                ]
                            )
            )

    def parse(self, response):
        yield {
            'data':response.body
        }

The screenshot would show me still in the log-in page. I need to add a timer to wait until the page loads for the next page, I figured waitForUrl would work as the url changes after the log-in page but scrapy_playwright does not accept it as an argument. Therefore, what can I use in place of this?

lime-n commented 2 years ago

Find out a few that worked

elacuesta commented 2 years ago

You are trying camelCase methods, which work on the node.js version of playwright. The Python version has been using snake_case since v1.8.0a1.

Supported coroutines are listed in the upstream page for the Page class.