PDF import fails. - Githubissues

cidrugHug8 commented 1 year ago

Hi, The following error was output to the docker log.

loading pdf
Warning: Indexing all PDF objects
Error
    at InvalidPDFExceptionClosure (/app/node_modules/pdf-parse/lib/pdf.js/v1.10.100/build/pdf.js:452:35)
    at Object.<anonymous> (/app/node_modules/pdf-parse/lib/pdf.js/v1.10.100/build/pdf.js:455:2)
    at __w_pdfjs_require__ (/app/node_modules/pdf-parse/lib/pdf.js/v1.10.100/build/pdf.js:45:30)
    at Object.<anonymous> (/app/node_modules/pdf-parse/lib/pdf.js/v1.10.100/build/pdf.js:7939:23)
    at __w_pdfjs_require__ (/app/node_modules/pdf-parse/lib/pdf.js/v1.10.100/build/pdf.js:45:30)
    at /app/node_modules/pdf-parse/lib/pdf.js/v1.10.100/build/pdf.js:88:18
    at /app/node_modules/pdf-parse/lib/pdf.js/v1.10.100/build/pdf.js:91:10
    at webpackUniversalModuleDefinition (/app/node_modules/pdf-parse/lib/pdf.js/v1.10.100/build/pdf.js:18:20)
    at Object.<anonymous> (/app/node_modules/pdf-parse/lib/pdf.js/v1.10.100/build/pdf.js:25:3)
    at Module._compile (node:internal/modules/cjs/loader:1254:14) {
  message: 'Invalid PDF structure'
}

n4ze3m commented 1 year ago

Hi there, I'm sorry for the PDF loading issue you encountered. Could you please confirm whether you used a protected PDF file? This information will help me better understand the problem and provide you with the appropriate solution. Thank you!

n4ze3m commented 1 year ago

UPDATE: I tested with a password-protected PDF file, and it failed to process. I will figure out how to resolve this issue.

cidrugHug8 commented 1 year ago

Thank you for your quick reply. PDF file is not encrypted. pdfinfo results are as follows.

$ pdfinfo  Engineer\ Reference.pdf 
Title:           My Document
Subject:         
Keywords:        
Author:          yamada
Producer:        madbuild
CreationDate:    Wed Jun 22 11:02:51 2022 JST
ModDate:         Wed Jun 22 11:02:51 2022 JST
Custom Metadata: no
Metadata Stream: no
Tagged:          no
UserProperties:  no
Suspects:        no
Form:            none
JavaScript:      no
Pages:           175
Encrypted:       no
Page size:       595.28 x 841.89 pts (A4)
Page rot:        0
File size:       25883386 bytes
Optimized:       no
PDF version:     1.4

noureldinz3r0 commented 1 year ago

same i get this error: PDF.js v2.9.359 (build: e667c8cbc) Message: Invalid PDF structure.

but not all pdfs, if the pdf is a bit large this happens

n4ze3m commented 1 year ago

same i get this error: PDF.js v2.9.359 (build: e667c8cbc) Message: Invalid PDF structure.

but not all pdfs, if the pdf is a bit large this happens

Yes, I am currently using pdf-parse as a document loader, but it cannot handle large files. So, I am trying to set up a custom loader that will split large PDFs into smaller ones and then feed them to pdf-parse.

MY221B commented 1 year ago

I have the same problem and splitting the large file into multiple 25-pages sections works.

So for anyone who has the same issue, for now you could try PyPDF2 in Python to split a PDF into separate files:

pip install PyPDF2
python pdf_splitter.py

import PyPDF2

pdf_path = "path/to/your/pdf.pdf" pages_per_file = 25

with open(pdf_path, "rb") as file: reader = PyPDF2.PdfReader(file) total_pages = len(reader.pages)

file_number = 1
page_count = 0
writer = PyPDF2.PdfWriter()

for page_number in range(total_pages):
    writer.add_page(reader.pages[page_number])
    page_count += 1

    if page_count == pages_per_file or page_number == total_pages - 1:
        output_filename = f"output_file_{file_number}.pdf"
        with open(output_filename, "wb") as output_file:
            writer.write(output_file)

        # Reset the page count and create a new writer for the next file
        page_count = 0
        file_number += 1
        writer = PyPDF2.PdfWriter()

python pdf_splitter.py

n4ze3m commented 1 year ago

Hey guys, I have created a custom PDF loader on v0.0.12. I hope it resolves the issue with large PDF files. Please try the latest version and let me know.

Note that the PDF loader still can't load protected PDF files.

l4time commented 1 year ago

I'm able to upload PDFs with thousand of pages with v0.0.12. Fixed it for me

n4ze3m commented 1 year ago

Closing this issue based on the comment above. Feel free to reopen if the problem still exists. Thank you.

n4ze3m / dialoqbase

PDF import fails. #4

python pdf_splitter.py