Langchain pdfLoader fails to extract URLs from hyperlinks in PDFs

anupray99 commented 1 month ago

Checked other resources

[X] I added a very descriptive title to this issue.
[X] I searched the LangChain.js documentation with the integrated search.
[X] I used the GitHub search to find a similar question and didn't find it.
[X] I am sure that this is a bug in LangChain.js rather than my code.
[X] The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

`
const { PDFLoader } = require('langchain/document_loaders/fs/pdf'); const { RecursiveCharacterTextSplitter } = require('langchain/text_splitter');

const splitter = new RecursiveCharacterTextSplitter({ chunkSize, chunkOverlap, }) const loader = new PDFLoader(filePath); const rawDocuments = await loader.loadAndSplit(splitter);

`

Error Message and Stack Trace (if applicable)

No response

Description

I am trying to use the langchain library to read PDFs using pdfLoader. My expectation is for it to extract URLs from hyperlinks within the PDF, but it fails to do so.

Despite updating to the latest versions of langchain, the issue per sists. It seems pdfLoader is unable to recognize or process the URLs from the hyperlinks.

System Info

Platform: Windows Node version: v18.18.0 Yarn version: 1.22.19

dosubot[bot] commented 1 month ago

Hey @anupray99! 👋 I'm here to help you with any issues you're experiencing while using LangChain.js. I can assist with bug fixes, answer questions, and guide you to become a contributor. Let's tackle this PDF hyperlink extraction problem together!

To extract URLs from hyperlinks within a PDF using the LangChain.js library's PDFLoader, you can follow this example:

import { WebPDFLoader } from "@langchain/community/document_loaders/web/pdf";

export const run = async () => {
  const blob = new Blob(); // e.g. from a file input
  const loader = new WebPDFLoader(blob);

  const docs = await loader.load();

  // Assuming docs contain text with hyperlinks, you need to extract URLs
  const urls = [];
  docs.forEach(doc => {
    const urlRegex = /(https?:\/\/[^\s]+)/g;
    const matches = doc.text.match(urlRegex);
    if (matches) {
      urls.push(...matches);
    }
  });

  console.log({ urls });
};

run();

This script will load the PDF from a Blob, extract the text, and then use a regular expression to find and extract URLs from the text ^[1]^[2].

To continue talking to Dosu, mention @dosu.

anupray99 commented 1 month ago

Hey @anupray99! 👋 I'm here to help you with any issues you're experiencing while using LangChain.js. I can assist with bug fixes, answer questions, and guide you to become a contributor. Let's tackle this PDF hyperlink extraction problem together!

To extract URLs from hyperlinks within a PDF using the LangChain.js library's PDFLoader, you can follow this example:
import { WebPDFLoader } from "@langchain/community/document_loaders/web/pdf";

export const run = async () => {
  const blob = new Blob(); // e.g. from a file input
  const loader = new WebPDFLoader(blob);

  const docs = await loader.load();

  // Assuming docs contain text with hyperlinks, you need to extract URLs
  const urls = [];
  docs.forEach(doc => {
    const urlRegex = /(https?:\/\/[^\s]+)/g;
    const matches = doc.text.match(urlRegex);
    if (matches) {
      urls.push(...matches);
    }
  });

  console.log({ urls });
};

run();
This script will load the PDF from a Blob, extract the text, and then use a regular expression to find and extract URLs from the text [1][2].

To continue talking to Dosu, mention @dosu.

Tried this approach, still facing the same issue.

dosubot[bot] commented 1 month ago

To extract URLs from hyperlinks within PDFs using the LangChain.js library, you can create a custom extractor function that processes the text content of the PDF to find URLs. Since PDFLoader and WebPDFLoader are not recognizing or processing the URLs, you can use the RecursiveUrlLoader class as a reference to build your own solution.

Here's an example of how you can achieve this:

Load the PDF content.
Extract the text from the PDF.
Use a regular expression to find URLs within the text.

Below is a sample implementation:

import { PDFDocument } from 'pdf-lib';
import fetch from 'node-fetch';

async function fetchPdf(url: string): Promise<Uint8Array> {
  const response = await fetch(url);
  const arrayBuffer = await response.arrayBuffer();
  return new Uint8Array(arrayBuffer);
}

async function extractTextFromPdf(pdfData: Uint8Array): Promise<string> {
  const pdfDoc = await PDFDocument.load(pdfData);
  const pages = pdfDoc.getPages();
  let text = '';

  for (const page of pages) {
    const pageText = await page.getTextContent();
    text += pageText.items.map(item => item.str).join(' ');
  }

  return text;
}

function extractUrls(text: string): string[] {
  const urlRegex = /https?:\/\/[^\s/$.?#].[^\s]*/g;
  return text.match(urlRegex) || [];
}

async function extractUrlsFromPdf(url: string): Promise<string[]> {
  const pdfData = await fetchPdf(url);
  const text = await extractTextFromPdf(pdfData);
  return extractUrls(text);
}

// Example usage
const pdfUrl = 'https://example.com/sample.pdf';
extractUrlsFromPdf(pdfUrl).then(urls => {
  console.log('Extracted URLs:', urls);
});

In this implementation:

fetchPdf fetches the PDF data from a given URL.
extractTextFromPdf extracts the text content from the PDF.
extractUrls uses a regular expression to find URLs within the extracted text.
extractUrlsFromPdf combines these steps to extract URLs from a PDF given its URL.

This approach allows you to extract URLs from hyperlinks within PDFs even if PDFLoader and WebPDFLoader are not recognizing or processing the URLs ^[1].

To continue talking to Dosu, mention @dosu.

langchain-ai / langchainjs