langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
95.68k stars 15.54k forks source link

BeautifulSoup transformer fails to treat links with internal tags the same way #25018

Open krodyrobi opened 4 months ago

krodyrobi commented 4 months ago

Checked other resources

Example Code

from langchain_community.document_transformers import BeautifulSoupTransformer
from langchain_core.documents import Document

text="""<a href="https://google.com/"><span>google</span></a>"""

b = BeautifulSoupTransformer()
docs = b.transform_documents(
   [Document(text)],
   tags_to_extract=["p", "li", "div", "a", "span", "h1", "h2", "h3", "h4", "h5", "h6"],
   remove_comments=True
)

print(docs[0].page_content)

Error Message and Stack Trace (if applicable)

No response

Description

Instead of seeing the same format as when extracting a <a href="https://google.com/">google</a> namely google (https://google.com/) we get just google because of the interior tags

System Info

System Information

OS: Windows OS Version: 10.0.19045 Python Version: 3.12.3 (tags/v3.12.3:f6650f9, Apr 9 2024, 14:05:25) [MSC v.1938 64 bit (AMD64)]

Package Information

langchain_core: 0.2.28 langchain: 0.2.12 langchain_community: 0.2.11 langsmith: 0.1.96 langchain_text_splitters: 0.2.2

Packages not installed (Not Necessarily a Problem)

The following packages were not found:

langgraph langserve

dosubot[bot] commented 1 month ago

Hi, @krodyrobi. I'm helping the LangChain team manage their backlog and am marking this issue as stale.

You reported a bug in the BeautifulSoup transformer where internal links are not formatted correctly, leading to outputs that do not display the expected URLs. As there have been no further comments or activity on this issue, we want to ensure it is still relevant.

Could you please let us know if this issue is still applicable to the latest version of the LangChain repository? If it is, feel free to comment here to keep it open. Otherwise, you can close the issue yourself, or it will be automatically closed in 7 days. Thank you!

krodyrobi commented 1 month ago

Yes, keep it open.

dosubot[bot] commented 1 month ago

@eyurtsev, the user has confirmed that the issue with the BeautifulSoup transformer is still relevant and should remain open. Could you please assist them with this?