pdfloader not reading content of some pdf files

mohitpandeyji commented 1 month ago

Checked other resources

[X] I added a very descriptive title to this issue.
[X] I searched the LangChain.js documentation with the integrated search.
[X] I used the GitHub search to find a similar question and didn't find it.
[X] I am sure that this is a bug in LangChain.js rather than my code.
[x] The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

import { PDFLoader } from 'langchain/document_loaders/fs/pdf'; import { CSVLoader } from 'langchain/document_loaders/fs/csv'; import { DocxLoader } from 'langchain/document_loaders/fs/docx';

import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';

export async function getChunkedDocsFromPdf( file: Blob, fileName: string, //@ts-expect-error ): Promise<Document<Record<string, any>>[]> { try { const loader = getLoader(file, fileName);

const docs = await loader.loadAndSplit();
// const docs = await loader.load();

if (!docs || docs.length === 0) {
  throw new Error('No documents were loaded from the PDF.');
}

const textSplitter = new RecursiveCharacterTextSplitter({
  chunkSize: 1000,
  chunkOverlap: 100,
});

const chunkedDocs = textSplitter.splitDocuments(docs);

return chunkedDocs;

} catch (e) { console.log(e); console.log('Failed to load pdf'); } }

Error Message and Stack Trace (if applicable)

Preparing chunks from dcf810de-7a0f-4096-ae85-e088b1bd9372/6a32a55d-142c-4490-96c0-00c54eeb6eba/DOCUMENT_UPLOAD/1718207214485-alectify_1249_26184.pdf_32295.pdf_37128.pdf Error: No documents were loaded from the PDF. at getChunkedDocsFromPdf (/home/mohit/Documents/llm-document-search/lib/pdf-loader.ts:2:1646) at async ingestPdf (/home/mohit/Documents/llm-document-search/lib/ingest.ts:2:1242) at async /home/mohit/Documents/llm-document-search/scripts/alectify-ingest.ts:2:548 Failed to load pdf Init client script failed TypeError: Cannot read properties of undefined (reading 'length') at ingestPdf (/home/mohit/Documents/llm-document-search/lib/ingest.ts:2:1334) at async /home/mohit/Documents/llm-document-search/scripts/alectify-ingest.ts:2:548 Error processing dcf810de-7a0f-4096-ae85-e088b1bd9372/6a32a55d-142c-4490-96c0-00c54eeb6eba/DOCUMENT_UPLOAD/1718207214485-alectify_1249_26184.pdf_32295.pdf_37128.pdf: Error: TypeError: Cannot read properties of undefined (reading 'length') at ingestPdf (/home/mohit/Documents/llm-document-search/lib/ingest.ts:2:1594) at async /home/mohit/Documents/llm-document-search/scripts/alectify-ingest.ts:2:548

Description

pdfloader not reading content of some pdf files 1718207214485-alectify_1249_26184.pdf_32295.pdf_37128.pdf attaching the file for which i am not getting data ,

const docs = await loader.loadAndSplit();

i am getting docs as empty array for above file , but it works for other files

System Info

npx pnpm info langchain

langchain@0.2.12 | MIT | deps: 16 | versions: 284 Typescript bindings for langchain https://github.com/langchain-ai/langchainjs/tree/main/langchain/

keywords: llm, ai, gpt3, chain, prompt, prompt engineering, chatgpt, machine learning, ml, openai, embeddings, vectorstores

dist .tarball: https://registry.npmjs.org/langchain/-/langchain-0.2.12.tgz .shasum: 3fac0b9519a070689b6dd679d5854abc57824dcf .integrity: sha512-ZHtJrHUpridZ7IQu7N/wAQ6iMAAO7VLzkupHqKP79S6p+alrPbn1BjRnh+PeGm92YiY5DafTCuvchmujxx7bCQ== .unpackedSize: 5.4 MB

dependencies: @langchain/core: >=0.2.11 <0.3.0 js-tiktoken: ^1.0.12 langsmith: ~0.1.30 uuid: ^10.0.0
@langchain/openai: >=0.1.0 <0.3.0 js-yaml: ^4.1.0 ml-distance: ^4.0.0 yaml: ^2.2.1
@langchain/textsplitters: ~0.0.0 jsonpointer: ^5.0.1 openapi-types: ^12.1.3 zod-to-json-schema: ^3.22.3
binary-extensions: ^2.2.0 langchainhub: ~0.0.8 p-retry: 4 zod: ^3.22.4

maintainers:

dist-tags: latest: 0.2.12 next: 0.2.3-rc.0

published 6 days ago by basproul braceasproul@gmail.com

dosubot[bot] commented 1 month ago

@mohitpandeyji Hi there! I'm here to help you with any issues or questions you have. I can also assist you in becoming a contributor.

The PDFLoader in LangChain.js might not be reading the content of some PDF files due to the variety and complexity of PDF formats. Here are a few steps you can take to address this issue:

Override the Separator: The loader parses individual text elements and joins them together with a space by default, which might not work correctly for all PDFs. You can try overriding the separator with an empty string:

import { PDFLoader } from "@langchain/community/document_loaders/fs/pdf";

const loader = new PDFLoader("src/document_loaders/example_data/example.pdf", {
  parsedItemSeparator: "",
});

const docs = await loader.load();

Handle Excessive Newlines: The WebPDFLoader class includes a mechanism to handle excessive newlines by checking the vertical position of text items. This can help in cases where the content is not being read correctly due to formatting issues:

const textItems = [];
let lastY;
for (const item of content.items) {
  if ("str" in item) {
    if (lastY === item.transform[5] || !lastY) {
      textItems.push(item.str);
    } else {
      textItems.push(`\n${item.str}`);
    }
    lastY = item.transform[5];
  }
}
const text = textItems.join(this.parsedItemSeparator);

Check Dependencies: Ensure that the pdf-parse library is installed and correctly set up, as it is used by the PDFLoader for its PDF parsing capabilities. If the library is not installed, you can add it using:
```
npm install pdf-parse
```

By implementing these steps, you should be able to improve the PDF content reading capabilities of the PDFLoader in LangChain.js ^[1]^[2].

To continue talking to Dosu, mention @dosu.

jeasonnow commented 1 month ago

https://github.com/langchain-ai/langchainjs/blob/48771908966919a3ba150ceb2b8e0f2228257722/libs/langchain-community/src/document_loaders/fs/pdf.ts#L57-L91

This is because the pdf loader only supports parsing plain text (textContent), which is a normal trade-off, as parsing scanned documents would require analyzing images and extracting text through OCR, which should not be the responsibility of the pdf loader. It is recommended to use non-scanned PDFs.

jeasonnow commented 1 month ago

The implementation of pdf loader is based on pdf-parse. Link: pdf-parse GitHub page

mohitpandeyji commented 1 month ago

Done in 2.1s mohit@mohit-hp-pavilion-laptop:~/Documents/llm-document-search$ npx pnpm prepare:data

langchain-pinecone@0.0.1 prepare:data /home/mohit/Documents/llm-document-search tsx -r dotenv/config ./scripts/alectify-ingest.ts

node:internal/modules/esm/resolve:299 return new ERR_PACKAGE_PATH_NOT_EXPORTED( ^

Error [ERR_PACKAGE_PATH_NOT_EXPORTED]: Package subpath './document_loaders/fs/pdf' is not defined by "exports" in /home/mohit/Documents/llm-document-search/node_modules/.pnpm/node_modules/@langchain/community/package.json at exportsNotFound (node:internal/modules/esm/resolve:299:10) at packageExportsResolve (node:internal/modules/esm/resolve:646:9) at resolveExports (node:internal/modules/cjs/loader:595:36) at Module._findPath (node:internal/modules/cjs/loader:685:31) at Module._resolveFilename (node:internal/modules/cjs/loader:1172:27) at a._resolveFilename (/home/mohit/Documents/llm-document-search/node_modules/.pnpm/tsx@4.7.0/node_modules/tsx/dist/cjs/index.cjs:1:1729) at Module._load (node:internal/modules/cjs/loader:1012:27) at Module.require (node:internal/modules/cjs/loader:1271:19) at require (node:internal/modules/helpers:123:16) at /home/mohit/Documents/llm-document-search/lib/pdf-loader.ts:2:1033 { code: 'ERR_PACKAGE_PATH_NOT_EXPORTED' }

Node.js v22.2.0 ELIFECYCLE Command failed with exit code 1. @dosu according to your first suggestion i have made changes but giving above issue

dosubot[bot] commented 1 month ago

The error you're encountering, ERR_PACKAGE_PATH_NOT_EXPORTED, indicates that the subpath ./document_loaders/fs/pdf is not defined in the exports field of the @langchain/community package's package.json.

To resolve this, you should import the PDFLoader from the correct path as defined in the package's exports. Based on the latest information, you should use the following import statement:

import { PDFLoader } from "@langchain/community/document_loaders/web/pdf";

Here's how you can update your code:

import { PDFLoader } from "@langchain/community/document_loaders/web/pdf";
import { CSVLoader } from 'langchain/document_loaders/fs/csv';
import { DocxLoader } from 'langchain/document_loaders/fs/docx';
import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';

function getLoader(file: Blob, fileName: string) {
  if (fileName.endsWith('.pdf')) {
    const loader = new PDFLoader(file);
    return loader;
  } else if (fileName.endsWith('.csv')) {
    const loader = new CSVLoader(file);
    return loader;
  } else if (fileName.endsWith('.docx')) {
    const loader = new DocxLoader(file);
    return loader;
  }
}

export async function getChunkedDocsFromPdf(
  file: Blob,
  fileName: string,
  //@ts-expect-error
): Promise<Document<Record<string, any>>[]> {
  try {
    const loader = getLoader(file, fileName);

    const docs = await loader.loadAndSplit();
    // const docs = await loader.load();

    if (!docs || docs.length === 0) {
      throw new Error('No documents were loaded from the PDF.');
    }

    const textSplitter = new RecursiveCharacterTextSplitter({
      chunkSize: 1000,
      chunkOverlap: 100,
    });

    const chunkedDocs = textSplitter.splitDocuments(docs);

    return chunkedDocs;
  } catch (e) {
    console.log(e);
    console.log('Failed to load pdf');
  }
}

This should resolve the ERR_PACKAGE_PATH_NOT_EXPORTED error by using the correct export path for the PDFLoader ^[1].

To continue talking to Dosu, mention @dosu.

jeasonnow commented 1 month ago

Maybe you could try using the UnstructuredLoader to tackle your question. @mohitpandeyji

TimSusa commented 2 weeks ago

From my side if you ask me it highly depends on the server answer you get, so in my case mostly this was sufficient as langchain just wants that blob.:

export enum MimeTypes {
  pdf = 'application/pdf',
  oct = 'binary/octet-stream',
  txt = 'text/plain',
  utext = 'text/plain; charset=utf-8',
  pptx = 'application/vnd.openxmlformats-officedocument.presentationml.presentation',
  docx = 'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
  xlsx = 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
  csv = 'text/csv',
}

export enum FileExtensions {
  pdf = '.pdf',
  txt = '.txt',
  docx = '.docx',
  pptx = '.pptx',
  csv = '.csv',
  xlsx = '.xlsx',
}

export type DocLoaders =
  | typeof DocxLoader
  | typeof PPTXLoader
  | typeof TextLoader
  | typeof WebPDFLoader
  | typeof CSVLoader;

export async function getDocData(
  loader: DocLoaders,
  data: ArrayBuffer | Buffer | Error,
  url: string,
  type: MimeTypes
): Promise<string> {
  try {
    if (data instanceof Buffer || data instanceof ArrayBuffer) {
      const blob = new Blob([data], { type });
      const docs = await processDoc(blob, loader);

      docs.map(({ pageContent }: Document, idx) => {
        return {
          pageContent,
          metadata: {
            page: idx,
            url,
          },
        };
      });

      return JSON.stringify(docs);
    }
    return 'error fetching page content';
  } catch (err) {
    if (err instanceof Error) {
      return 'error fetching page content';
    } else {
      return 'error fetching page content';
    }
  }
}

export async function processDoc(blob: Blob, Loader: DocLoaders) {
  const localLoader = new Loader(blob);
  const doc = await localLoader.load();
  return doc;
}

export async function getDocsByLoader({
  url,
  fileExt,
  data,
}: {
  url: string;
  fileExt: string;
  data: Buffer | ArrayBuffer | Error;
}) {
  if (fileExt === FileExtensions.txt) {
    return await getDocData(TextLoader, data, url, MimeTypes.txt);
  } else if (fileExt === FileExtensions.pdf) {
    return await getDocData(WebPDFLoader, data, url, MimeTypes.pdf);
  } else if (fileExt === FileExtensions.docx) {
    return await getDocData(DocxLoader, data, url, MimeTypes.docx);
  } else if (fileExt === FileExtensions.pptx) {
    return await getDocData(PPTXLoader, data, url, MimeTypes.pptx);
  } else if (fileExt === FileExtensions.csv) {
    return await getDocData(CSVLoader, data, url, MimeTypes.csv);
  } else if (fileExt === FileExtensions.xlsx) {
    const workbook = Xlsx.read(data, { type: 'buffer' });

    let mergedCsvData = `${getFileName(url)},`;
    let isFirstSheet = true;

    // In order to merge all sheets into one CSV, we need to add the sheet name to each row
    workbook.SheetNames.forEach(sheetName => {
      const worksheet = workbook.Sheets[sheetName];
      const csvData = Xlsx.utils.sheet_to_csv(worksheet);

      // Add "Sheet Name" to each row of the current sheet
      const rows = csvData
        .split('\n')
        .map((row, index) => {
          if (index === 0 && isFirstSheet) {
            return row; // Keep the header as is
          }
          return sheetName + ',' + row;
        })
        .join('\n');

      mergedCsvData += rows + '\n';
      isFirstSheet = false;
    });

    return await getDocData(CSVLoader, Buffer.from(mergedCsvData), url, MimeTypes.csv);
  }
  return 'Nothing found';
}

So for example using Microsoft Graph API I had to do this before the loader was eating my data:

async function readableStreamToArrayBuffer(stream: ReadableStream): Promise<ArrayBuffer> {
  const response = new Response(stream as BodyInit);
  const arrayBuffer = await response.arrayBuffer();
  return arrayBuffer;
}

langchain-ai / langchainjs