langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
92.52k stars 14.81k forks source link

BSHTMLLoader not working for urls #8795

Closed iiitmahesh closed 1 year ago

iiitmahesh commented 1 year ago

System Info

BSHTMLLoader not working for urls

from langchain.document_loaders import BSHTMLLoader

url = "https://www.google.com"

loader = BSHTMLLoader({"url": url})
doc = loader.load()

I tried this one also not working

from langchain.document_loaders import BSHTMLLoader

url = "https://www.google.com"

loader = BSHTMLLoader({"url": url})
doc = loader.load()

Who can help?

No response

Information

Related Components

Reproduction

from langchain.document_loaders import BSHTMLLoader

url = "https://www.example.com"

loader = BSHTMLLoader(url)
doc = loader.load()

Expected behavior

It has respond html data of url. I believe BSHTMLLoader won't work with url and only work files(.html files).

dosubot[bot] commented 1 year ago

Answer generated by a 🤖

Answer

Hello,

Based on the information you've provided, it seems like you're trying to use the BSHTMLLoader to load URLs, which is not its intended use. The BSHTMLLoader is designed to load HTML documents from a file path, not URLs. Here's the relevant code from the BSHTMLLoader class:

class BSHTMLLoader(BaseLoader):
    """Loader that uses beautiful soup to parse HTML files."""

    def __init__(
        self,
        file_path: str,
        open_encoding: Union[str, None] = None,
        bs_kwargs: Union[dict, None] = None,
        get_text_separator: str = "",
    ) -> None:
        """Initialise with path, and optionally, file encoding to use, and any kwargs
        to pass to the BeautifulSoup object.
        """
        self.file_path = file_path
        self.open_encoding = open_encoding
        if bs_kwargs is None:
            bs_kwargs = {"features": "lxml"}
        self.bs_kwargs = bs_kwargs
        self.get_text_separator = get_text_separator

    def load(self) -> List[Document]:
        """Load HTML document into document objects."""
        from bs4 import BeautifulSoup

        with open(self.file_path, "r", encoding=self.open_encoding) as f:
            soup = BeautifulSoup(f, **self.bs_kwargs)

        text = soup.get_text(self.get_text_separator)

        if soup.title:
            title = str(soup.title.string)
        else:
            title = ""

        metadata: Dict[str, Union[str, None]] = {
            "source": self.file_path,
            "title": title,
        }
        return [Document(page_content=text, metadata=metadata)]

As you can see, the BSHTMLLoader takes a file path as an argument, not a URL. It then opens the file, parses it with BeautifulSoup, and extracts the text content and title.

If you want to load URLs, you might want to use a different loader. For example, the UnstructuredURLLoader class is designed to load data from URLs. However, it uses the unstructured library to partition the content of the URLs, not BeautifulSoup.

There was a similar issue reported in the LangChain repository: UnstructuredURLLoader can't load data from url. The solution was to install libmagic. Depending on your operating system, you might need to install additional packages.

I hope this helps! If you have any other questions, feel free to ask.

Sources


This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

iiitmahesh commented 1 year ago

I am currently developing a custom load function as an alternative to BSHTMLLoader for handling URLs within my local codebase