Open mohitpandeyji opened 1 month ago
@mohitpandeyji Hi there! I'm here to help you with any issues or questions you have. I can also assist you in becoming a contributor.
The PDFLoader
in LangChain.js might not be reading the content of some PDF files due to the variety and complexity of PDF formats. Here are a few steps you can take to address this issue:
Override the Separator: The loader parses individual text elements and joins them together with a space by default, which might not work correctly for all PDFs. You can try overriding the separator with an empty string:
import { PDFLoader } from "@langchain/community/document_loaders/fs/pdf";
const loader = new PDFLoader("src/document_loaders/example_data/example.pdf", {
parsedItemSeparator: "",
});
const docs = await loader.load();
Handle Excessive Newlines: The WebPDFLoader
class includes a mechanism to handle excessive newlines by checking the vertical position of text items. This can help in cases where the content is not being read correctly due to formatting issues:
const textItems = [];
let lastY;
for (const item of content.items) {
if ("str" in item) {
if (lastY === item.transform[5] || !lastY) {
textItems.push(item.str);
} else {
textItems.push(`\n${item.str}`);
}
lastY = item.transform[5];
}
}
const text = textItems.join(this.parsedItemSeparator);
Check Dependencies: Ensure that the pdf-parse
library is installed and correctly set up, as it is used by the PDFLoader
for its PDF parsing capabilities. If the library is not installed, you can add it using:
npm install pdf-parse
By implementing these steps, you should be able to improve the PDF content reading capabilities of the PDFLoader
in LangChain.js [1][2].
To continue talking to Dosu, mention @dosu.
This is because the pdf loader
only supports parsing plain text (textContent
), which is a normal trade-off, as parsing scanned documents would require analyzing images and extracting text through OCR, which should not be the responsibility of the pdf loader
. It is recommended to use non-scanned PDFs.
The implementation of pdf loader
is based on pdf-parse
.
Link: pdf-parse GitHub page
Done in 2.1s mohit@mohit-hp-pavilion-laptop:~/Documents/llm-document-search$ npx pnpm prepare:data
langchain-pinecone@0.0.1 prepare:data /home/mohit/Documents/llm-document-search tsx -r dotenv/config ./scripts/alectify-ingest.ts
node:internal/modules/esm/resolve:299 return new ERR_PACKAGE_PATH_NOT_EXPORTED( ^
Error [ERR_PACKAGE_PATH_NOT_EXPORTED]: Package subpath './document_loaders/fs/pdf' is not defined by "exports" in /home/mohit/Documents/llm-document-search/node_modules/.pnpm/node_modules/@langchain/community/package.json at exportsNotFound (node:internal/modules/esm/resolve:299:10) at packageExportsResolve (node:internal/modules/esm/resolve:646:9) at resolveExports (node:internal/modules/cjs/loader:595:36) at Module._findPath (node:internal/modules/cjs/loader:685:31) at Module._resolveFilename (node:internal/modules/cjs/loader:1172:27) at a._resolveFilename (/home/mohit/Documents/llm-document-search/node_modules/.pnpm/tsx@4.7.0/node_modules/tsx/dist/cjs/index.cjs:1:1729) at Module._load (node:internal/modules/cjs/loader:1012:27) at Module.require (node:internal/modules/cjs/loader:1271:19) at require (node:internal/modules/helpers:123:16) at /home/mohit/Documents/llm-document-search/lib/pdf-loader.ts:2:1033 { code: 'ERR_PACKAGE_PATH_NOT_EXPORTED' }
Node.js v22.2.0 ELIFECYCLE Command failed with exit code 1. @dosu according to your first suggestion i have made changes but giving above issue
The error you're encountering, ERR_PACKAGE_PATH_NOT_EXPORTED
, indicates that the subpath ./document_loaders/fs/pdf
is not defined in the exports
field of the @langchain/community
package's package.json
.
To resolve this, you should import the PDFLoader
from the correct path as defined in the package's exports
. Based on the latest information, you should use the following import statement:
import { PDFLoader } from "@langchain/community/document_loaders/web/pdf";
Here's how you can update your code:
import { PDFLoader } from "@langchain/community/document_loaders/web/pdf";
import { CSVLoader } from 'langchain/document_loaders/fs/csv';
import { DocxLoader } from 'langchain/document_loaders/fs/docx';
import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';
function getLoader(file: Blob, fileName: string) {
if (fileName.endsWith('.pdf')) {
const loader = new PDFLoader(file);
return loader;
} else if (fileName.endsWith('.csv')) {
const loader = new CSVLoader(file);
return loader;
} else if (fileName.endsWith('.docx')) {
const loader = new DocxLoader(file);
return loader;
}
}
export async function getChunkedDocsFromPdf(
file: Blob,
fileName: string,
//@ts-expect-error
): Promise<Document<Record<string, any>>[]> {
try {
const loader = getLoader(file, fileName);
const docs = await loader.loadAndSplit();
// const docs = await loader.load();
if (!docs || docs.length === 0) {
throw new Error('No documents were loaded from the PDF.');
}
const textSplitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 100,
});
const chunkedDocs = textSplitter.splitDocuments(docs);
return chunkedDocs;
} catch (e) {
console.log(e);
console.log('Failed to load pdf');
}
}
This should resolve the ERR_PACKAGE_PATH_NOT_EXPORTED
error by using the correct export path for the PDFLoader
[1].
To continue talking to Dosu, mention @dosu.
Maybe you could try using the UnstructuredLoader
to tackle your question.
@mohitpandeyji
From my side if you ask me it highly depends on the server answer you get, so in my case mostly this was sufficient as langchain just wants that blob.:
export enum MimeTypes {
pdf = 'application/pdf',
oct = 'binary/octet-stream',
txt = 'text/plain',
utext = 'text/plain; charset=utf-8',
pptx = 'application/vnd.openxmlformats-officedocument.presentationml.presentation',
docx = 'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
xlsx = 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
csv = 'text/csv',
}
export enum FileExtensions {
pdf = '.pdf',
txt = '.txt',
docx = '.docx',
pptx = '.pptx',
csv = '.csv',
xlsx = '.xlsx',
}
export type DocLoaders =
| typeof DocxLoader
| typeof PPTXLoader
| typeof TextLoader
| typeof WebPDFLoader
| typeof CSVLoader;
export async function getDocData(
loader: DocLoaders,
data: ArrayBuffer | Buffer | Error,
url: string,
type: MimeTypes
): Promise<string> {
try {
if (data instanceof Buffer || data instanceof ArrayBuffer) {
const blob = new Blob([data], { type });
const docs = await processDoc(blob, loader);
docs.map(({ pageContent }: Document, idx) => {
return {
pageContent,
metadata: {
page: idx,
url,
},
};
});
return JSON.stringify(docs);
}
return 'error fetching page content';
} catch (err) {
if (err instanceof Error) {
return 'error fetching page content';
} else {
return 'error fetching page content';
}
}
}
export async function processDoc(blob: Blob, Loader: DocLoaders) {
const localLoader = new Loader(blob);
const doc = await localLoader.load();
return doc;
}
export async function getDocsByLoader({
url,
fileExt,
data,
}: {
url: string;
fileExt: string;
data: Buffer | ArrayBuffer | Error;
}) {
if (fileExt === FileExtensions.txt) {
return await getDocData(TextLoader, data, url, MimeTypes.txt);
} else if (fileExt === FileExtensions.pdf) {
return await getDocData(WebPDFLoader, data, url, MimeTypes.pdf);
} else if (fileExt === FileExtensions.docx) {
return await getDocData(DocxLoader, data, url, MimeTypes.docx);
} else if (fileExt === FileExtensions.pptx) {
return await getDocData(PPTXLoader, data, url, MimeTypes.pptx);
} else if (fileExt === FileExtensions.csv) {
return await getDocData(CSVLoader, data, url, MimeTypes.csv);
} else if (fileExt === FileExtensions.xlsx) {
const workbook = Xlsx.read(data, { type: 'buffer' });
let mergedCsvData = `${getFileName(url)},`;
let isFirstSheet = true;
// In order to merge all sheets into one CSV, we need to add the sheet name to each row
workbook.SheetNames.forEach(sheetName => {
const worksheet = workbook.Sheets[sheetName];
const csvData = Xlsx.utils.sheet_to_csv(worksheet);
// Add "Sheet Name" to each row of the current sheet
const rows = csvData
.split('\n')
.map((row, index) => {
if (index === 0 && isFirstSheet) {
return row; // Keep the header as is
}
return sheetName + ',' + row;
})
.join('\n');
mergedCsvData += rows + '\n';
isFirstSheet = false;
});
return await getDocData(CSVLoader, Buffer.from(mergedCsvData), url, MimeTypes.csv);
}
return 'Nothing found';
}
So for example using Microsoft Graph API I had to do this before the loader was eating my data:
async function readableStreamToArrayBuffer(stream: ReadableStream): Promise<ArrayBuffer> {
const response = new Response(stream as BodyInit);
const arrayBuffer = await response.arrayBuffer();
return arrayBuffer;
}
Checked other resources
Example Code
import { PDFLoader } from 'langchain/document_loaders/fs/pdf'; import { CSVLoader } from 'langchain/document_loaders/fs/csv'; import { DocxLoader } from 'langchain/document_loaders/fs/docx';
import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';
function getLoader(file: Blob, fileName: string) { if (fileName.endsWith('.pdf')) { const loader = new PDFLoader(file); return loader; } else if (fileName.endsWith('.csv')) { const loader = new CSVLoader(file); return loader; } else if (fileName.endsWith('.docx')) { const loader = new DocxLoader(file); return loader; } }
export async function getChunkedDocsFromPdf( file: Blob, fileName: string, //@ts-expect-error ): Promise<Document<Record<string, any>>[]> { try { const loader = getLoader(file, fileName);
} catch (e) { console.log(e); console.log('Failed to load pdf'); } }
Error Message and Stack Trace (if applicable)
Preparing chunks from dcf810de-7a0f-4096-ae85-e088b1bd9372/6a32a55d-142c-4490-96c0-00c54eeb6eba/DOCUMENT_UPLOAD/1718207214485-alectify_1249_26184.pdf_32295.pdf_37128.pdf Error: No documents were loaded from the PDF. at getChunkedDocsFromPdf (/home/mohit/Documents/llm-document-search/lib/pdf-loader.ts:2:1646) at async ingestPdf (/home/mohit/Documents/llm-document-search/lib/ingest.ts:2:1242) at async /home/mohit/Documents/llm-document-search/scripts/alectify-ingest.ts:2:548 Failed to load pdf Init client script failed TypeError: Cannot read properties of undefined (reading 'length') at ingestPdf (/home/mohit/Documents/llm-document-search/lib/ingest.ts:2:1334) at async /home/mohit/Documents/llm-document-search/scripts/alectify-ingest.ts:2:548 Error processing dcf810de-7a0f-4096-ae85-e088b1bd9372/6a32a55d-142c-4490-96c0-00c54eeb6eba/DOCUMENT_UPLOAD/1718207214485-alectify_1249_26184.pdf_32295.pdf_37128.pdf: Error: TypeError: Cannot read properties of undefined (reading 'length') at ingestPdf (/home/mohit/Documents/llm-document-search/lib/ingest.ts:2:1594) at async /home/mohit/Documents/llm-document-search/scripts/alectify-ingest.ts:2:548
Description
pdfloader not reading content of some pdf files 1718207214485-alectify_1249_26184.pdf_32295.pdf_37128.pdf attaching the file for which i am not getting data ,
i am getting docs as empty array for above file , but it works for other files
System Info
npx pnpm info langchain
langchain@0.2.12 | MIT | deps: 16 | versions: 284 Typescript bindings for langchain https://github.com/langchain-ai/langchainjs/tree/main/langchain/
keywords: llm, ai, gpt3, chain, prompt, prompt engineering, chatgpt, machine learning, ml, openai, embeddings, vectorstores
dist .tarball: https://registry.npmjs.org/langchain/-/langchain-0.2.12.tgz .shasum: 3fac0b9519a070689b6dd679d5854abc57824dcf .integrity: sha512-ZHtJrHUpridZ7IQu7N/wAQ6iMAAO7VLzkupHqKP79S6p+alrPbn1BjRnh+PeGm92YiY5DafTCuvchmujxx7bCQ== .unpackedSize: 5.4 MB
dependencies: @langchain/core: >=0.2.11 <0.3.0 js-tiktoken: ^1.0.12 langsmith: ~0.1.30 uuid: ^10.0.0
@langchain/openai: >=0.1.0 <0.3.0 js-yaml: ^4.1.0 ml-distance: ^4.0.0 yaml: ^2.2.1
@langchain/textsplitters: ~0.0.0 jsonpointer: ^5.0.1 openapi-types: ^12.1.3 zod-to-json-schema: ^3.22.3
binary-extensions: ^2.2.0 langchainhub: ~0.0.8 p-retry: 4 zod: ^3.22.4
maintainers:
dist-tags: latest: 0.2.12 next: 0.2.3-rc.0
published 6 days ago by basproul braceasproul@gmail.com