Closed beethogedeon closed 1 month ago
Hey, @beethogedeon can you provide the URL where you are facing the problem?
For the URL currently given by you (https://example.com/), the problem lies in the extractor. you have used a very basic extractor and the code can be changed to:
from langchain_community.document_loaders.recursive_url_loader import RecursiveUrlLoader
from bs4 import BeautifulSoup as Soup
def text_extractor(r_text: str) -> str:
soup = Soup(r_text, "html.parser")
return " ".join(soup.text.split())
url = "https://www.example.com/"
loader = RecursiveUrlLoader(
url=url,
max_depth=2,
extractor=text_extractor,
)
docs = loader.load()
PS: This only solves the problem of extra whitespaces.
Checked other resources
Example Code
Error Message and Stack Trace (if applicable)
No response
Description
I'm trying to use "Recursive URL" Document loaders from "langchain_community.document_loaders.recursive_url_loader" to process load all URLs under a root directory but css or js links are also processed
System Info
System Information
Package Information