[X] I added a very descriptive title to this issue.
[X] I searched the LangChain documentation with the integrated search.
[X] I used the GitHub search to find a similar question and didn't find it.
[X] I am sure that this is a bug in LangChain rather than my code.
[X] The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).
Example Code
Property Addtion
To address the issue of extracting embedded links along with text from PDFs, I modified the PyMuPDFLoader and PyMuPDFParser classes by adding a new property: with_embedded_links: bool.
To handle text extraction along with embedded links, I modified the lazy_parse function in the PyMuPDFParser class. Instead of using: page_content=page.get_text(**self.text_kwargs)+ self._extract_images_from_page(doc, page) I implemented :page_content=self._get_page_content(page)+ self._extract_images_from_page(doc, page) .
The updated _get_page_content function is as follows:
def _get_page_content(self, page) -> str:
if not self.with_embedded_links:
return page.get_text(**self.text_kwargs)
import fitz
extracted_text :str = ""
# Get text in dictionary form to analyze the content
text_instances = page.get_text("dict")
# Get all hyperlinks on the page
links = page.get_links()
# Prepare a list to store text with hyperlinks
text_with_links = []
# Iterate through each block of text
for block in text_instances["blocks"]:
if 'lines' not in block:
continue
for line in block["lines"]:
for span in line["spans"]:
span_bbox = span["bbox"]
span_text = span["text"]
# Check if the span overlaps with any hyperlink
hyperlink = None
for link in links[:]:
link_bbox = link["from"] # Get the bounding box of the hyperlink area
if fitz.Rect(span_bbox).intersects(fitz.Rect(link_bbox)) and 'uri' in link:
hyperlink = link["uri"] # Get the hyperlink URL
links.remove(link) # Remove the link from the list
break
# Append the text along with the hyperlink (if found)
if hyperlink:
text_with_links.append(f"{span_text} [URL: {hyperlink}]")
else:
text_with_links.append(span_text)
# Combine the extracted text
extracted_text = ("\n".join(text_with_links))
return extracted_text
This modification allows the extraction of both the text and any embedded links from the PDF. I believe this could be a useful feature and hope it can be considered for inclusion as an option for PDF text extraction with embedded links.
Additionally, it would be beneficial to extend this capability to UnstructuredPDFLoader or UnstructuredFileLoader to support link extraction along with text.
To see FAQs click here [URL: https://www.faqs_check.com] something like this.
Overall Summary
Added with_embedded_links: bool property in PyMuPDFLoader and PyMuPDFParser.
Modified lazy_parse function in PyMuPDFParser to use a custom _get_page_content function
This function _get_page_content(page) to extract text along with embedded links from a PDF.
Error Message and Stack Trace (if applicable)
No response
Description
Problem
I am trying to extract text along with embedded links from PDFs so that the AI can provide the links when needed. Currently, there is no existing PDF loader that supports this functionality. To solve this, I implemented a custom modification. While my solution works, I believe it should be reviewed and potentially added to help others who also need to extract both text and embedded links from PDFs.
What I Need?
The ability to extract text with embedded links from PDFs.
What I Have Done
I made a simple modification to the PyMuPDFLoader and PyMuPDFParser to achieve this functionality.
What I Expect
I would appreciate a review of my code, and if possible, suggestions for a better solution. I also hope that this functionality could be extended not just to PyMuPDFLoader but also to UnstructuredPDFLoader and UnstructuredFileLoader to support link extraction along with text from PDFs.
Example
Suppose this is in pdf .. To see FAQs click here will be extracted as this To see FAQs click here [URL: https://www.faqs_check.com]
System Info
System Information
OS: Windows
OS Version: 10.0.22631
Python Version: 3.11.7 (tags/v3.11.7:fa7a6f2, Dec 4 2023, 19:24:49) [MSC v.1937 64 bit (AMD64)]
Checked other resources
Example Code
Property Addtion
To address the issue of extracting embedded links along with text from PDFs, I modified the
PyMuPDFLoader
andPyMuPDFParser
classes by adding a new property: with_embedded_links: bool.And similarly in
PyMuPDFParser
:Modification for Parsing Text with Links
To handle text extraction along with embedded links, I modified the
lazy_parse
function in the PyMuPDFParser class. Instead of using:page_content=page.get_text(**self.text_kwargs)+ self._extract_images_from_page(doc, page)
I implemented :page_content=self._get_page_content(page)+ self._extract_images_from_page(doc, page)
.The updated _get_page_content function is as follows:
This modification allows the extraction of both the text and any embedded links from the PDF. I believe this could be a useful feature and hope it can be considered for inclusion as an option for PDF text extraction with embedded links.
Additionally, it would be beneficial to extend this capability to
UnstructuredPDFLoader
orUnstructuredFileLoader
to support link extraction along with text.Function Call
Input
To see FAQs click here
Output Of Extraction Text Before Modificattion
To see FAQs click here
Output Of Extraction Text After Modificattion
To see FAQs click here [URL: https://www.faqs_check.com]
something like this.Overall Summary
with_embedded_links: bool
property in PyMuPDFLoader and PyMuPDFParser.lazy_parse
function in PyMuPDFParser to use a custom_get_page_content
functionError Message and Stack Trace (if applicable)
No response
Description
Problem
I am trying to extract text along with embedded links from PDFs so that the AI can provide the links when needed. Currently, there is no existing PDF loader that supports this functionality. To solve this, I implemented a custom modification. While my solution works, I believe it should be reviewed and potentially added to help others who also need to extract both text and embedded links from PDFs.
What I Need?
What I Have Done
PyMuPDFLoader
andPyMuPDFParser
to achieve this functionality.What I Expect
PyMuPDFLoader
but also toUnstructuredPDFLoader
andUnstructuredFileLoader
to support link extraction along with text from PDFs.Example
To see FAQs click here
will be extracted as thisTo see FAQs click here [URL: https://www.faqs_check.com]
System Info
System Information
Package Information
Optional packages not installed
Other Dependencies