langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
93.13k stars 14.98k forks source link

"Recursive URL" Document loader load useless documents #21204

Closed beethogedeon closed 1 month ago

beethogedeon commented 5 months ago

Checked other resources

Example Code

from langchain_community.document_loaders.recursive_url_loader import RecursiveUrlLoader
from bs4 import BeautifulSoup as Soup

url = "https://www.example.com/"
loader = RecursiveUrlLoader(
    url=url, max_depth=2, extractor=lambda x: Soup(x, "html.parser").text
)
docs = loader.load()

Error Message and Stack Trace (if applicable)

No response

Description

I'm trying to use "Recursive URL" Document loaders from "langchain_community.document_loaders.recursive_url_loader" to process load all URLs under a root directory but css or js links are also processed

System Info

System Information

OS: Linux OS Version: #1 SMP Tue Dec 19 13:14:11 UTC 2023 Python Version: 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0]

Package Information

langchain_core: 0.1.48 langchain: 0.1.17 langchain_community: 0.0.36 langsmith: 0.1.52 langchain_cohere: 0.1.4 langchain_text_splitters: 0.0.1

Siddhesh-Agarwal commented 5 months ago

Hey, @beethogedeon can you provide the URL where you are facing the problem?

For the URL currently given by you (https://example.com/), the problem lies in the extractor. you have used a very basic extractor and the code can be changed to:

from langchain_community.document_loaders.recursive_url_loader import RecursiveUrlLoader
from bs4 import BeautifulSoup as Soup

def text_extractor(r_text: str) -> str:
    soup = Soup(r_text, "html.parser")
    return " ".join(soup.text.split())

url = "https://www.example.com/"
loader = RecursiveUrlLoader(
    url=url,
    max_depth=2,
    extractor=text_extractor,
)
docs = loader.load()
Siddhesh-Agarwal commented 5 months ago

PS: This only solves the problem of extra whitespaces.