Unable to use WholeSiteReader on the Cloud

VaradDeshmukh97 commented 7 months ago

Question Validation

[X] I have searched both the documentation and discord for an answer.

Question

I am using WholeSiteReader() for loading websites into LlamaIndex Document objects. On my local system it works fine, but when I run my notebook on the Cloud, it says

ValueError: No chrome executable found on PATH

I understand that the driver needs to be given the path to the Chrome Executable, but I am unable to work it out. Any help will be appreciated. Thanks!

dosubot[bot] commented 7 months ago

Hey @VaradDeshmukh97, great to see you around here again! 🚀 Your adventures in the cloud sound like quite the saga. I'm on it - will get back to you with a strategy to conquer this error shortly! 🌩️

VaradDeshmukh97 commented 7 months ago

When I run this,

from llama_index.readers.web import WholeSiteReader
from selenium import webdriver
def load_websites(prefix, base_url, max_depth=1):
    '''
    Recursively searches and loads a website into LlamaIndex Document objects.
    Args:
        prefix = url of the website
        base_url = url of the page from where to start the scraping
    '''
    driver = webdriver.Chrome()
    scraper = WholeSiteReader(
        prefix=prefix,
        max_depth=max_depth,
        driver=driver
    )
    websites = scraper.load_data(
        base_url=base_url
    )
    return websites

I get the following error:

SessionNotCreatedException: Message: session not created: Chrome failed to start: exited normally.
  (session not created: DevToolsActivePort file doesn't exist)
  (The process started from chrome location /teamspace/studios/this_studio/.cache/selenium/chrome/linux64/123.0.6312.122/chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
Stacktrace:
#0 0x55b5d0b0f863 <unknown>
#1 0x55b5d08058c6 <unknown>
#2 0x55b5d0839d34 <unknown>
#3 0x55b5d0835d3d <unknown>
#4 0x55b5d087eaed <unknown>
#5 0x55b5d0872343 <unknown>
#6 0x55b5d0843593 <unknown>
#7 0x55b5d0843f5e <unknown>
#8 0x55b5d0ad384b <unknown>
#9 0x55b5d0ad77a5 <unknown>
#10 0x55b5d0ac1571 <unknown>
#11 0x55b5d0ad8332 <unknown>
#12 0x55b5d0aa687f <unknown>
#13 0x55b5d0afe728 <unknown>
#14 0x55b5d0afe8fb <unknown>
#15 0x55b5d0b0e9b4 <unknown>
#16 0x7ff2fb9b8609 start_thread

dosubot[bot] commented 7 months ago

The error you're encountering suggests a compatibility issue between Chrome and ChromeDriver or a problem with locating the Chrome executable in your cloud environment. Here's how to address these issues:

Check Chrome and ChromeDriver Compatibility: Ensure the versions of Chrome and ChromeDriver you're using are compatible. Mismatches can lead to the SessionNotCreatedException.
Use Remote WebDriver: In cloud environments, it's common to use a Remote WebDriver setup. This involves configuring your code to use webdriver.Remote() instead of webdriver.Chrome(), specifying the URL of the Selenium Grid or cloud-based testing service and the desired capabilities.
Specify Chrome Binary Location: If the Chrome binary is in a non-standard location, you may need to explicitly set its path using ChromeOptions. This is done by setting opt.binary_location = '/path/to/chrome' before initializing the WebDriver.

Here's a snippet for using a Remote WebDriver, which you should adjust based on your environment:

from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

def setup_driver(self):
    selenium_grid_url = "http://your.selenium.grid.url:4444/wd/hub"
    capabilities = DesiredCapabilities.CHROME.copy()

    opt = webdriver.ChromeOptions()
    # Specify binary location if necessary
    # opt.binary_location = '/path/to/chrome'
    capabilities.update(opt.to_capabilities())

    return webdriver.Remote(command_executor=selenium_grid_url, desired_capabilities=capabilities)

Replace "http://your.selenium.grid.url:4444/wd/hub" with your Selenium Grid hub URL or the endpoint of your cloud-based Selenium service. This setup requires the remote environment to have the correct Chrome and ChromeDriver versions installed. If you're still facing issues, please share more details about your cloud setup, including the versions of Chrome and ChromeDriver and the location of the Chrome executable.

Sources

llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/whole_site/base.py

#### About Dosu

This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.

suryaanshrai commented 4 months ago

Hey @VaradDeshmukh97 did you found any solution or workaround to this issue? I am encountering the same issue and it would be amazing if I could use your help!

3sakshij commented 1 month ago

Hi @suryaanshrai , Got the same errors, this solution works for me You can edit the function and add these parameters. def setup_driver(self): """ Sets up the Selenium WebDriver for Chrome.

    Returns:
        WebDriver: An instance of Chrome WebDriver.
    """
    try:
        import chromedriver_autoinstaller
    except ImportError:
        raise ImportError("Please install chromedriver_autoinstaller")

    opt = webdriver.ChromeOptions()
    opt.add_argument("--start-maximized")
    opt.add_argument('--headless')
    opt.add_argument('--no-sandbox')
    opt.add_argument('--disable-dev-shm-usage')**
    chromedriver_autoinstaller.install()
    return webdriver.Chrome(options=opt)

run-llama / llama_index