Extending advertools to crawl dynamic websites

ska-ibees commented 1 year ago

Feature Proposal: Support for Scraping Dynamic Websites using Playwright

Problem

Presently, advertools does a great job of web scraping static websites. However, there is a dire need to add support for scraping dynamic websites that load or modify content using JavaScript.

Proposed Solution

To address this issue, I propose integrating Playwright into advertools. Playwright is a robust, feature-rich, and highly capable library for browser automation. It supports multiple browsers (Chromium, Firefox, and WebKit) and provides a high-level API to control headless (or full) browsers.

Playwright is also a preferred library by Scrapy itself. Read more here.

Using Playwright would enable advertools to load dynamic content by executing JavaScript, waiting for specific events, or even user-like interactions before scraping the page. This would vastly extend the reach of advertools and enable it to scrape more complex and modern websites.

Details of the Solution

Integration: Integrate Playwright (by using scrapy-playwright ) into the existing advertools infrastructure, ensuring it is an optional dependency so as not to burden users who don't require this feature.

For this, a new module plw_spider.py is added by cloning the existing spider.py. So, it has all of the existing functionality of crawling plus playwright-supported features.

By doing so, we will have entirely isolated features that will make it an optional choice for the users. Dependent libraries for plw_spider will be kept out of the main package and will be installed manually.

Browser Support: Leverage Playwright's ability to control multiple browsers, thereby offering users a choice of scraping engine.
Dynamic Content Handling: Implement functionality to interact with and scrape dynamically loaded content. This might include executing JavaScript, waiting for AJAX requests to complete, handling pop-ups, or clicking buttons.

Refer to the upstream docs for the Page class to see available methods.

API Design: Ensure the API for interacting with Playwright aligns with the existing advertools API to maintain consistency for the users.
Example: For example, the "Screenshot" Page method is pre-processed to generate unique filenames for the screenshots for each url. This is done by updating the meta before initiating the crawl request. See the code here.

See it in action in the Google collab here

Benefits

By adding support for dynamic website scraping via Playwright, advertools will become a more versatile tool, able to handle a wider range of websites and use cases. This would potentially attract more users to advertools and make it a stronger competitor in the web scraping tool market.

I look forward to further thoughts on this proposal and am ready to commence work on this feature as soon as we get the go-ahead.

Thank you.

ska-ibees commented 1 year ago

Todo:

S.no	Feature name	Description	Colab Link	Status
1	Full JS Crawling	JS rendering for content that is dynamically rendered using js_crawl function	Link	- [ Done]
2	Save screenshots	Capture screenshots for a given URL list using save_screenshot function	Link	- [ Done] ,
4	Website Interactions for Crawling	Simulate interactions like clicks, scrolls, or form submissions.		- [ Pending]
5	Page Speed Insights	By using variables like network speed, geographical location, device type, etc.		- [ Pending]

ska-ibees commented 1 year ago

Save screenshots

Capture screenshots for a given URL list using the save_screenshot function.

Colab Link

Current request:

`from scrapy_playwright.page import PageMethod import advertools as adv

url_list = ['https://www.wikipedia.org', "https://quotes.toscrape.com"]

output_dir = "/content/advertools/output"

meta = { "playwright": True, "playwright_page_methods": [ PageMethod("screenshot", path=output_dir, full_page=True, type="jpeg", quality=80), ], }

custom_settings = { "DOWNLOAD_HANDLERS": { "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler", "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler", }, "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor", 'PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT': '100000', "PLAYWRIGHT_BROWSER_TYPE": "chromium", "PLAYWRIGHT_LAUNCH_OPTIONS": { "headless": True, "timeout": 20 * 1000, # 20 seconds } }

adv.save_screenshot( url_list=url_list, output_file=f"{output_dir}/output.jl", meta=meta, custom_settings=custom_settings ) ` Should we keep it as is or move meta and custom_settings in to the save_screenshot function itself to simplify user input by only passing url_list and output directory. Like this:

`import advertools as adv

url_list = ['https://www.wikipedia.org', "https://quotes.toscrape.com"]

output_dir = "/content/advertools/output"

adv.save_screenshot( url_list=url_list, output_file=f"{output_dir}", ) ` Please suggest.

ska-ibees / advertools