Open ska-ibees opened 1 year ago
Todo:
S.no | Feature name | Description | Colab Link | Status |
---|---|---|---|---|
1 | Full JS Crawling | JS rendering for content that is dynamically rendered using js_crawl function | Link | - [ Done] |
2 | Save screenshots | Capture screenshots for a given URL list using save_screenshot function | Link | - [ Done] , |
4 | Website Interactions for Crawling | Simulate interactions like clicks, scrolls, or form submissions. | - [ Pending] | |
5 | Page Speed Insights | By using variables like network speed, geographical location, device type, etc. | - [ Pending] |
Capture screenshots for a given URL list using the save_screenshot function.
Current request:
`from scrapy_playwright.page import PageMethod import advertools as adv
url_list = ['https://www.wikipedia.org', "https://quotes.toscrape.com"]
output_dir = "/content/advertools/output"
meta = { "playwright": True, "playwright_page_methods": [ PageMethod("screenshot", path=output_dir, full_page=True, type="jpeg", quality=80), ], }
custom_settings = { "DOWNLOAD_HANDLERS": { "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler", "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler", }, "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor", 'PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT': '100000', "PLAYWRIGHT_BROWSER_TYPE": "chromium", "PLAYWRIGHT_LAUNCH_OPTIONS": { "headless": True, "timeout": 20 * 1000, # 20 seconds } }
adv.save_screenshot( url_list=url_list, output_file=f"{output_dir}/output.jl", meta=meta, custom_settings=custom_settings ) ` Should we keep it as is or move meta and custom_settings in to the save_screenshot function itself to simplify user input by only passing url_list and output directory. Like this:
`import advertools as adv
url_list = ['https://www.wikipedia.org', "https://quotes.toscrape.com"]
output_dir = "/content/advertools/output"
adv.save_screenshot( url_list=url_list, output_file=f"{output_dir}", ) ` Please suggest.
Feature Proposal: Support for Scraping Dynamic Websites using Playwright
Problem
Presently, advertools does a great job of web scraping static websites. However, there is a dire need to add support for scraping dynamic websites that load or modify content using JavaScript.
Proposed Solution
To address this issue, I propose integrating Playwright into advertools. Playwright is a robust, feature-rich, and highly capable library for browser automation. It supports multiple browsers (Chromium, Firefox, and WebKit) and provides a high-level API to control headless (or full) browsers.
Playwright is also a preferred library by Scrapy itself. Read more here.
Using Playwright would enable advertools to load dynamic content by executing JavaScript, waiting for specific events, or even user-like interactions before scraping the page. This would vastly extend the reach of advertools and enable it to scrape more complex and modern websites.
Details of the Solution
For this, a new module plw_spider.py is added by cloning the existing spider.py. So, it has all of the existing functionality of crawling plus playwright-supported features.
By doing so, we will have entirely isolated features that will make it an optional choice for the users. Dependent libraries for plw_spider will be kept out of the main package and will be installed manually.
Browser Support: Leverage Playwright's ability to control multiple browsers, thereby offering users a choice of scraping engine.
Dynamic Content Handling: Implement functionality to interact with and scrape dynamically loaded content. This might include executing JavaScript, waiting for AJAX requests to complete, handling pop-ups, or clicking buttons.
Refer to the upstream docs for the Page class to see available methods.
See it in action in the Google collab here
Benefits
By adding support for dynamic website scraping via Playwright, advertools will become a more versatile tool, able to handle a wider range of websites and use cases. This would potentially attract more users to advertools and make it a stronger competitor in the web scraping tool market.
I look forward to further thoughts on this proposal and am ready to commence work on this feature as soon as we get the go-ahead.
Thank you.