stanford-oval / storm

An LLM-powered knowledge curation system that researches a topic and generates a full-length report with citations.
http://storm.genie.stanford.edu
MIT License
10.12k stars 960 forks source link

[feature request] Selenium integration #95

Closed Tom-Neverwinter closed 2 weeks ago

Tom-Neverwinter commented 1 month ago

Summary Implement a feature that allows the system to perform automated web searches and scrape relevant content using Selenium and Scrapy.

This is a quick and dirty framework of the idea, this is in no way final or working.

lets be honest here nobody wants to pay for YouRM, BingSearch when we can do this ourselves

Motivation This feature will enable the Storm-oval project to search the web for relevant articles, gather information, and scrape necessary content without relying on external APIs. This can be useful for building a local knowledge base, gathering data for analysis, or enhancing the functionality of the virtual assistant.

Proposed Implementation Dependencies Installation:

Install Scrapy and Selenium:

pip install scrapy selenium

WebDriver Setup:

Download the WebDriver for your browser (e.g., ChromeDriver for Chrome) and ensure it is accessible from your system path. Scrapy Project Setup:

Create a new Scrapy project:

scrapy startproject search_scraper
cd search_scraper
scrapy genspider search_spider example.com

Selenium Script for Web Search:

Create a Python script (search.py) to perform web searches and save URLs:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time

def search(query):
    driver = webdriver.Chrome(executable_path='path/to/chromedriver')
    driver.get('https://www.google.com')

    search_box = driver.find_element_by_name('q')
    search_box.send_keys(query)
    search_box.send_keys(Keys.RETURN)

    time.sleep(2)

    links = driver.find_elements_by_css_selector('a')
    urls = [link.get_attribute('href') for link in links if link.get_attribute('href') and 'http' in link.get_attribute('href')]

    driver.quit()
    return urls

if __name__ == '__main__':
    query = 'best practices for optimizing solar panel efficiency'
    urls = search(query)
    with open('urls.txt', 'w') as f:
        for url in urls:
            f.write(f"{url}\n")

Scrapy Spider for Content Scraping:

Modify search_scraper/spiders/search_spider.py to scrape content from the URLs:

import scrapy

class SearchSpider(scrapy.Spider):
    name = 'search_spider'
    start_urls = []

    def __init__(self):
        with open('urls.txt', 'r') as f:
            self.start_urls = [url.strip() for url in f.readlines()]

    def parse(self, response):
        title = response.css('title::text').get()
        content = response.css('body').get()
        yield {
            'url': response.url,
            'title': title,
            'content': content
        }

Running the Scrapy Spider:

Execute the spider to scrape the content and save it to a file:

scrapy crawl search_spider -o results.json

Benefits Automation: Automatically perform web searches and scrape content without manual intervention. Cost-Efficient: Avoid the costs associated with using external APIs. Customization: Tailor the search and scraping process to specific needs and preferences. Potential Use Cases Building a local knowledge base. Gathering data for research and analysis. Enhancing the functionality of the virtual assistant.

Yucheng-Jiang commented 1 month ago

Hi, we appreciate your suggestion here. It's a handy tool for individual developers to run STORM with minimum usage.

However, we do want to stress that we should respect to the crawling / scraping policy, and for most of the time it's non trivial. It would be problematic if we query search engine extensively without using their API that could potentially lead to serious trouble.

There are many alternative / more affordable search engine API options. We are happy to review and merge effort down this line.