Checked other resources

[X] I added a very descriptive title to this issue.
[X] I searched the LangChain documentation with the integrated search.
[X] I used the GitHub search to find a similar question and didn't find it.
[X] I am sure that this is a bug in LangChain rather than my code.
[X] The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

Property Addtion

To address the issue of extracting embedded links along with text from PDFs, I modified the PyMuPDFLoader and PyMuPDFParser classes by adding a new property: with_embedded_links: bool.

class PyMuPDFLoader(BasePDFLoader):
    """Load `PDF` files using `PyMuPDF`."""

    def __init__(
        self,
        file_path: str,
        *,
        headers: Optional[Dict] = None,
        extract_images: bool = False,
        with_embedded_links: bool = False,
        **kwargs: Any,
    ) -> None:

And similarly in PyMuPDFParser:

class PyMuPDFParser(BaseBlobParser):
    """Parse `PDF` using `PyMuPDF`."""

    def __init__(
        self,
        text_kwargs: Optional[Mapping[str, Any]] = None,
        extract_images: bool = False,
        with_embedded_links: bool  = False
    ) -> None:

Modification for Parsing Text with Links

To handle text extraction along with embedded links, I modified the lazy_parse function in the PyMuPDFParser class. Instead of using: page_content=page.get_text(**self.text_kwargs)+ self._extract_images_from_page(doc, page) I implemented :page_content=self._get_page_content(page)+ self._extract_images_from_page(doc, page) .

The updated _get_page_content function is as follows:

    def _get_page_content(self, page) -> str:
        if not self.with_embedded_links:
            return page.get_text(**self.text_kwargs)

        import fitz
        extracted_text :str = ""
        # Get text in dictionary form to analyze the content
        text_instances = page.get_text("dict")
        # Get all hyperlinks on the page
        links = page.get_links()
        # Prepare a list to store text with hyperlinks
        text_with_links = []

        # Iterate through each block of text
        for block in text_instances["blocks"]:
            if 'lines' not in block:
                continue
            for line in block["lines"]:
                for span in line["spans"]:
                    span_bbox = span["bbox"]
                    span_text = span["text"]

                    # Check if the span overlaps with any hyperlink
                    hyperlink = None
                    for link in links[:]:
                        link_bbox = link["from"]  # Get the bounding box of the hyperlink area
                        if fitz.Rect(span_bbox).intersects(fitz.Rect(link_bbox)) and 'uri' in link:
                            hyperlink = link["uri"]  # Get the hyperlink URL
                            links.remove(link)  # Remove the link from the list
                            break

                    # Append the text along with the hyperlink (if found)
                    if hyperlink:
                        text_with_links.append(f"{span_text} [URL: {hyperlink}]")
                    else:
                        text_with_links.append(span_text)
            # Combine the extracted text
            extracted_text = ("\n".join(text_with_links))
        return extracted_text

This modification allows the extraction of both the text and any embedded links from the PDF. I believe this could be a useful feature and hope it can be considered for inclusion as an option for PDF text extraction with embedded links.

Additionally, it would be beneficial to extend this capability to UnstructuredPDFLoader or UnstructuredFileLoader to support link extraction along with text.

Function Call

PyMuPDFLoader(file_path=path, with_embedded_links=True)

Input

To see FAQs click here
Output Of Extraction Text Before Modificattion
To see FAQs click here
Output Of Extraction Text After Modificattion
To see FAQs click here [URL: https://www.faqs_check.com] something like this.

Overall Summary

Added with_embedded_links: bool property in PyMuPDFLoader and PyMuPDFParser.
Modified lazy_parse function in PyMuPDFParser to use a custom _get_page_content function
This function _get_page_content(page) to extract text along with embedded links from a PDF.

Error Message and Stack Trace (if applicable)

No response

Description

Problem

I am trying to extract text along with embedded links from PDFs so that the AI can provide the links when needed. Currently, there is no existing PDF loader that supports this functionality. To solve this, I implemented a custom modification. While my solution works, I believe it should be reviewed and potentially added to help others who also need to extract both text and embedded links from PDFs.

What I Need?

The ability to extract text with embedded links from PDFs.

What I Have Done

I made a simple modification to the PyMuPDFLoader and PyMuPDFParser to achieve this functionality.

What I Expect

I would appreciate a review of my code, and if possible, suggestions for a better solution. I also hope that this functionality could be extended not just to PyMuPDFLoader but also to UnstructuredPDFLoader and UnstructuredFileLoader to support link extraction along with text from PDFs.

Example

Suppose this is in pdf .. To see FAQs click here will be extracted as this To see FAQs click here [URL: https://www.faqs_check.com]

System Info

System Information

OS: Windows OS Version: 10.0.22631 Python Version: 3.11.7 (tags/v3.11.7:fa7a6f2, Dec 4 2023, 19:24:49) [MSC v.1937 64 bit (AMD64)]

Package Information

langchain_core: 0.2.29 langchain: 0.2.11 langchain_community: 0.2.10 langsmith: 0.1.93 langchain_aws: 0.1.16 langchain_google_genai: 1.0.8 langchain_milvus: 0.1.3 langchain_openai: 0.1.17 langchain_postgres: 0.0.9 langchain_text_splitters: 0.2.2 langchainhub: 0.1.20

Optional packages not installed

langgraph langserve

Other Dependencies

aiohttp: 3.9.5 async-timeout: 4.0.3 boto3: 1.34.149 dataclasses-json: 0.6.7 google-generativeai: 0.7.2 jsonpatch: 1.33 numpy: 1.26.4 openai: 1.37.1 orjson: 3.10.6 packaging: 24.1 pgvector: 0.2.5 pillow: 10.4.0 psycopg: 3.2.1 psycopg-pool: 3.2.2 pydantic: 2.8.2 pymilvus: 2.4.4 PyYAML: 6.0.1 requests: 2.32.3 scipy: 1.14.0 sqlalchemy: 2.0.31 SQLAlchemy: 2.0.31 tenacity: 8.5.0 tiktoken: 0.7.0 types-requests: 2.32.0.20240712 typing-extensions: 4.12.2

langchain-ai / langchain

Unable to Extract Text along with Embedded Links from PDF for AI-Driven Link Retrieval #26776