langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
94.37k stars 15.25k forks source link

Make an option in WebBaseLoader to handle dynamic content that is loaded via JavaScript. #4838

Closed shamspias closed 1 year ago

shamspias commented 1 year ago

Feature request

When you request a webpage using a library like requests or aiohttp, you're getting the initial HTML of the page, but any content that's loaded via JavaScript after the page loads will not be included. That's why you might see template tags like (item.price)}} taka instead of the actual values. Those tags are placeholders that get filled in with actual data by JavaScript after the page loads.

To handle this, you'll need to use a library that can execute JavaScript. A commonly used one is Selenium, but it's heavier than requests or aiohttp because it requires running an actual web browser. But is there any other option that doesn't need running an actual web browser or can use in long-chain without needing graphical interface like using Headless Browsers tools like pyppeteer (Python wrapper for Puppeteer)

anyway please solve the issue and ad feathers like this. thanks in advance.

Motivation

To get dynamic content from a webpage while scraping text from a website or webpage.

Your contribution

For my side, I rewrite the _fetch method in your WebBaseLoader class to use pyppeteer instead of aiohttp. But still not working but I think this might little help. here is my code, there I Overwirte the class

import pyppeteer
import asyncio
from langchain.document_loaders import WebBaseLoader as BaseWebBaseLoader

class WebBaseLoader(BaseWebBaseLoader):

    async def _fetch(
            self, url: str, selector: str = 'body', retries: int = 3, cooldown: int = 2, backoff: float = 1.5
    ) -> str:
        for i in range(retries):
            try:
                browser = await pyppeteer.launch()
                page = await browser.newPage()
                await page.goto(url)
                await page.waitForSelector(selector)  # waits for a specific element to be loaded
                await asyncio.sleep(5)  # waits for 5 seconds before getting the content
                content = await page.content()  # This gets the full HTML, including any dynamically loaded content
                await browser.close()
                return content
            except Exception as e:
                if i == retries - 1:
                    raise
                else:
                    logger.warning(
                        f"Error fetching {url} with attempt "
                        f"{i + 1}/{retries}: {e}. Retrying..."
                    )
                    await asyncio.sleep(cooldown * backoff ** i)
        raise ValueError("retry count exceeded")

and Install this two lib

pip install pyppeteer
pyppeteer-install
sommohapatra commented 1 year ago

Does SeleniumURLLoader work for you? From my experimentation, it seems to use a headless version to load a webpage.

Ex:

from langchain.document_loaders import SeleniumURLLoader
from langchain.text_splitter import NLTKTextSplitter

def __load_url(url_strings):
    loader = SeleniumURLLoader(urls=url_strings)
    pages = loader.load()
    text_splitter = NLTKTextSplitter(chunk_size=500, chunk_overlap=100)
    docs = text_splitter.split_documents(pages)
    return docs
shamspias commented 1 year ago

It's not. I try with this I try to use customized still not working :/

dosubot[bot] commented 1 year ago

Hi, @shamspias! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, you requested a feature to add an option in WebBaseLoader to handle dynamic content loaded via JavaScript when scraping webpages. You provided a code snippet using pyppeteer as an alternative to Selenium, but it seems that it is not working for you. Sommohapatra suggested using SeleniumURLLoader as a possible solution, but you confirmed that it is not working either.

Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself or it will be automatically closed in 7 days.

Thank you for your understanding and contribution to the LangChain project!