langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
92.62k stars 14.82k forks source link

Unable to Extract Text along with Embedded Links from PDF for AI-Driven Link Retrieval #26776

Open zonayedriyadh opened 11 hours ago

zonayedriyadh commented 11 hours ago

Checked other resources

Example Code

Property Addtion

To address the issue of extracting embedded links along with text from PDFs, I modified the PyMuPDFLoader and PyMuPDFParser classes by adding a new property: with_embedded_links: bool.

class PyMuPDFLoader(BasePDFLoader):
    """Load `PDF` files using `PyMuPDF`."""

    def __init__(
        self,
        file_path: str,
        *,
        headers: Optional[Dict] = None,
        extract_images: bool = False,
        with_embedded_links: bool = False,
        **kwargs: Any,
    ) -> None:

And similarly in PyMuPDFParser:

class PyMuPDFParser(BaseBlobParser):
    """Parse `PDF` using `PyMuPDF`."""

    def __init__(
        self,
        text_kwargs: Optional[Mapping[str, Any]] = None,
        extract_images: bool = False,
        with_embedded_links: bool  = False
    ) -> None:

Modification for Parsing Text with Links

To handle text extraction along with embedded links, I modified the lazy_parse function in the PyMuPDFParser class. Instead of using: page_content=page.get_text(**self.text_kwargs)+ self._extract_images_from_page(doc, page) I implemented :page_content=self._get_page_content(page)+ self._extract_images_from_page(doc, page) .

The updated _get_page_content function is as follows:

    def _get_page_content(self, page) -> str:
        if not self.with_embedded_links:
            return page.get_text(**self.text_kwargs)

        import fitz
        extracted_text :str = ""
        # Get text in dictionary form to analyze the content
        text_instances = page.get_text("dict")
        # Get all hyperlinks on the page
        links = page.get_links()
        # Prepare a list to store text with hyperlinks
        text_with_links = []

        # Iterate through each block of text
        for block in text_instances["blocks"]:
            if 'lines' not in block:
                continue
            for line in block["lines"]:
                for span in line["spans"]:
                    span_bbox = span["bbox"]
                    span_text = span["text"]

                    # Check if the span overlaps with any hyperlink
                    hyperlink = None
                    for link in links[:]:
                        link_bbox = link["from"]  # Get the bounding box of the hyperlink area
                        if fitz.Rect(span_bbox).intersects(fitz.Rect(link_bbox)) and 'uri' in link:
                            hyperlink = link["uri"]  # Get the hyperlink URL
                            links.remove(link)  # Remove the link from the list
                            break

                    # Append the text along with the hyperlink (if found)
                    if hyperlink:
                        text_with_links.append(f"{span_text} [URL: {hyperlink}]")
                    else:
                        text_with_links.append(span_text)
            # Combine the extracted text
            extracted_text = ("\n".join(text_with_links))
        return extracted_text

This modification allows the extraction of both the text and any embedded links from the PDF. I believe this could be a useful feature and hope it can be considered for inclusion as an option for PDF text extraction with embedded links.

Additionally, it would be beneficial to extend this capability to UnstructuredPDFLoader or UnstructuredFileLoader to support link extraction along with text.

Function Call

Input

Overall Summary

Error Message and Stack Trace (if applicable)

No response

Description

Problem

I am trying to extract text along with embedded links from PDFs so that the AI can provide the links when needed. Currently, there is no existing PDF loader that supports this functionality. To solve this, I implemented a custom modification. While my solution works, I believe it should be reviewed and potentially added to help others who also need to extract both text and embedded links from PDFs.

What I Need?

What I Have Done

What I Expect

Example

System Info

System Information

OS: Windows OS Version: 10.0.22631 Python Version: 3.11.7 (tags/v3.11.7:fa7a6f2, Dec 4 2023, 19:24:49) [MSC v.1937 64 bit (AMD64)]

Package Information

langchain_core: 0.2.29 langchain: 0.2.11 langchain_community: 0.2.10 langsmith: 0.1.93 langchain_aws: 0.1.16 langchain_google_genai: 1.0.8 langchain_milvus: 0.1.3 langchain_openai: 0.1.17 langchain_postgres: 0.0.9 langchain_text_splitters: 0.2.2 langchainhub: 0.1.20

Optional packages not installed

langgraph langserve

Other Dependencies

aiohttp: 3.9.5 async-timeout: 4.0.3 boto3: 1.34.149 dataclasses-json: 0.6.7 google-generativeai: 0.7.2 jsonpatch: 1.33 numpy: 1.26.4 openai: 1.37.1 orjson: 3.10.6 packaging: 24.1 pgvector: 0.2.5 pillow: 10.4.0 psycopg: 3.2.1 psycopg-pool: 3.2.2 pydantic: 2.8.2 pymilvus: 2.4.4 PyYAML: 6.0.1 requests: 2.32.3 scipy: 1.14.0 sqlalchemy: 2.0.31 SQLAlchemy: 2.0.31 tenacity: 8.5.0 tiktoken: 0.7.0 types-requests: 2.32.0.20240712 typing-extensions: 4.12.2

keenborder786 commented 3 hours ago

Shouldn't you create a relevant PR for this.