Open anupray99 opened 1 month ago
Hey @anupray99! 👋 I'm here to help you with any issues you're experiencing while using LangChain.js. I can assist with bug fixes, answer questions, and guide you to become a contributor. Let's tackle this PDF hyperlink extraction problem together!
To extract URLs from hyperlinks within a PDF using the LangChain.js library's PDFLoader
, you can follow this example:
import { WebPDFLoader } from "@langchain/community/document_loaders/web/pdf";
export const run = async () => {
const blob = new Blob(); // e.g. from a file input
const loader = new WebPDFLoader(blob);
const docs = await loader.load();
// Assuming docs contain text with hyperlinks, you need to extract URLs
const urls = [];
docs.forEach(doc => {
const urlRegex = /(https?:\/\/[^\s]+)/g;
const matches = doc.text.match(urlRegex);
if (matches) {
urls.push(...matches);
}
});
console.log({ urls });
};
run();
This script will load the PDF from a Blob, extract the text, and then use a regular expression to find and extract URLs from the text [1][2].
To continue talking to Dosu, mention @dosu.
Hey @anupray99! 👋 I'm here to help you with any issues you're experiencing while using LangChain.js. I can assist with bug fixes, answer questions, and guide you to become a contributor. Let's tackle this PDF hyperlink extraction problem together!
To extract URLs from hyperlinks within a PDF using the LangChain.js library's
PDFLoader
, you can follow this example:import { WebPDFLoader } from "@langchain/community/document_loaders/web/pdf"; export const run = async () => { const blob = new Blob(); // e.g. from a file input const loader = new WebPDFLoader(blob); const docs = await loader.load(); // Assuming docs contain text with hyperlinks, you need to extract URLs const urls = []; docs.forEach(doc => { const urlRegex = /(https?:\/\/[^\s]+)/g; const matches = doc.text.match(urlRegex); if (matches) { urls.push(...matches); } }); console.log({ urls }); }; run();
This script will load the PDF from a Blob, extract the text, and then use a regular expression to find and extract URLs from the text [1][2].
To continue talking to Dosu, mention @dosu.
Tried this approach, still facing the same issue.
To extract URLs from hyperlinks within PDFs using the LangChain.js library, you can create a custom extractor function that processes the text content of the PDF to find URLs. Since PDFLoader
and WebPDFLoader
are not recognizing or processing the URLs, you can use the RecursiveUrlLoader
class as a reference to build your own solution.
Here's an example of how you can achieve this:
Below is a sample implementation:
import { PDFDocument } from 'pdf-lib';
import fetch from 'node-fetch';
async function fetchPdf(url: string): Promise<Uint8Array> {
const response = await fetch(url);
const arrayBuffer = await response.arrayBuffer();
return new Uint8Array(arrayBuffer);
}
async function extractTextFromPdf(pdfData: Uint8Array): Promise<string> {
const pdfDoc = await PDFDocument.load(pdfData);
const pages = pdfDoc.getPages();
let text = '';
for (const page of pages) {
const pageText = await page.getTextContent();
text += pageText.items.map(item => item.str).join(' ');
}
return text;
}
function extractUrls(text: string): string[] {
const urlRegex = /https?:\/\/[^\s/$.?#].[^\s]*/g;
return text.match(urlRegex) || [];
}
async function extractUrlsFromPdf(url: string): Promise<string[]> {
const pdfData = await fetchPdf(url);
const text = await extractTextFromPdf(pdfData);
return extractUrls(text);
}
// Example usage
const pdfUrl = 'https://example.com/sample.pdf';
extractUrlsFromPdf(pdfUrl).then(urls => {
console.log('Extracted URLs:', urls);
});
In this implementation:
fetchPdf
fetches the PDF data from a given URL.extractTextFromPdf
extracts the text content from the PDF.extractUrls
uses a regular expression to find URLs within the extracted text.extractUrlsFromPdf
combines these steps to extract URLs from a PDF given its URL.This approach allows you to extract URLs from hyperlinks within PDFs even if PDFLoader
and WebPDFLoader
are not recognizing or processing the URLs [1].
To continue talking to Dosu, mention @dosu.
Checked other resources
Example Code
`
const { PDFLoader } = require('langchain/document_loaders/fs/pdf'); const { RecursiveCharacterTextSplitter } = require('langchain/text_splitter');
const splitter = new RecursiveCharacterTextSplitter({ chunkSize, chunkOverlap, }) const loader = new PDFLoader(filePath); const rawDocuments = await loader.loadAndSplit(splitter);
`
Error Message and Stack Trace (if applicable)
No response
Description
I am trying to use the langchain library to read PDFs using
pdfLoader
. My expectation is for it to extract URLs from hyperlinks within the PDF, but it fails to do so.Despite updating to the latest versions of langchain, the issue per sists. It seems
pdfLoader
is unable to recognize or process the URLs from the hyperlinks.System Info
Platform: Windows Node version: v18.18.0 Yarn version: 1.22.19