Closed iiitmahesh closed 1 year ago
Answer generated by a 🤖
Hello,
Based on the information you've provided, it seems like you're trying to use the BSHTMLLoader
to load URLs, which is not its intended use. The BSHTMLLoader
is designed to load HTML documents from a file path, not URLs. Here's the relevant code from the BSHTMLLoader
class:
class BSHTMLLoader(BaseLoader):
"""Loader that uses beautiful soup to parse HTML files."""
def __init__(
self,
file_path: str,
open_encoding: Union[str, None] = None,
bs_kwargs: Union[dict, None] = None,
get_text_separator: str = "",
) -> None:
"""Initialise with path, and optionally, file encoding to use, and any kwargs
to pass to the BeautifulSoup object.
"""
self.file_path = file_path
self.open_encoding = open_encoding
if bs_kwargs is None:
bs_kwargs = {"features": "lxml"}
self.bs_kwargs = bs_kwargs
self.get_text_separator = get_text_separator
def load(self) -> List[Document]:
"""Load HTML document into document objects."""
from bs4 import BeautifulSoup
with open(self.file_path, "r", encoding=self.open_encoding) as f:
soup = BeautifulSoup(f, **self.bs_kwargs)
text = soup.get_text(self.get_text_separator)
if soup.title:
title = str(soup.title.string)
else:
title = ""
metadata: Dict[str, Union[str, None]] = {
"source": self.file_path,
"title": title,
}
return [Document(page_content=text, metadata=metadata)]
As you can see, the BSHTMLLoader
takes a file path as an argument, not a URL. It then opens the file, parses it with BeautifulSoup, and extracts the text content and title.
If you want to load URLs, you might want to use a different loader. For example, the UnstructuredURLLoader
class is designed to load data from URLs. However, it uses the unstructured
library to partition the content of the URLs, not BeautifulSoup.
There was a similar issue reported in the LangChain repository: UnstructuredURLLoader can't load data from url. The solution was to install libmagic
. Depending on your operating system, you might need to install additional packages.
I hope this helps! If you have any other questions, feel free to ask.
This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
I am currently developing a custom load function as an alternative to BSHTMLLoader for handling URLs within my local codebase
System Info
BSHTMLLoader not working for urls
I tried this one also not working
Who can help?
No response
Information
Related Components
Reproduction
Expected behavior
It has respond html data of url. I believe BSHTMLLoader won't work with url and only work files(.html files).